How we found who was “poisoning” our memcached server

Memcached is a straightforward memory cache system. You send it a key/value pair, it stores then in memory.

While ago, one of our memcached servers was being populated with outdated data and, of course, returning outdated data.

We had to found which server was doing that. But how? We decided it was time to some strace. We took the memcached server out of production so it would be easier to trace the system calls. After that, we started the process:

$ ssh myserver
$ pgrep -l memcached
13708 memcached
$ sudo strace -f -t -s1024 -p 13708

Then we’ve found this very useful line:

[pid 13714] 10:32:57 read(38, "set 357933550488859b4caae308d73f2df7 2 6 252\r\n{\"status\": 401, \"container_count\": null, \"storage_policies\": {\"0\": {\"object_count\": 0, \"container_count\": 0, \"bytes\": 0}, \"1\": {\"object_count\": 0, \"container_count\": 0, \"bytes\": 0}}, \"bytes\": null, \"total_object_count\": null, \"meta\": {}, \"sysmeta\": {}}\r\n", 2048) = 300

This was the outdated data. And the first number after “read”, is the file descriptor which from the memcached was receiving that data.

Ok but who is “38”? Well, that is a good moment to call your good friend “lsof“. lsof can list all open files on your system, including tcp connections!

$ sudo lsof -n -p 13708|grep 38
memcached 13708 memcached 38u IPv4 2140063769 0t0 TCP> (ESTABLISHED)


Server is connected to our memcached and sending outdated data!

After that discovery, we logged into the server found out it wasn’t restarted after its last configuration change. We restarted it and everything got back to normal. Problem solved!

Nexts steps:  we are working on how to improve our deployment process. We know puppet, our configuration management tool,  can automatically restart servers after a configuration file change but we have some worries about this automation. Since we receive 1 billion req/day this service is quite important and we can’t risk an outage due some bad automation.

That is all folks. I hope you enjoy the post and please, feel free to comment any tip or question you have about it!

See you all on the next post 🙂

TCP connection (usually) is not a good test

Yesterday we were setting up a master/follower cluster.

The point: to know if the master is alive, the follower performs only a TCP connection.

Why is that a problem?

Due to an unsuccessful execution of an “poweroff” command, the master was able to receive a TCP connection but wasn’t able to answer real requests. Since the follower only checks if it can establish a TCP connection with the master, and it could, it never became the master as expected, making our cluster unavailable.

What did we learn from that?

If you want to test a service, make a real request. Don’t use “ping”, “telnet” or something like that. Do a REAL request.

IMPORTANT: make sure your requests are real but also light. You don’t want your monitors/tests to overload your service.

Thanks for reading 🙂


OSI model helps in the real world

Our recently created worker, which runs once a day, stoped work. The exception was quite straight forward: “OperationalError: (2006, ‘MySQL server has gone away’)”.

So, let’s tackle this issue. Is the mysql server up? Do the servers have access between each other?

$ telnet 3306
Connected to
Escape character is '^]'.

Oh right, server is up and I have access to it.  What is happening?

Me: – Dev, how do you connect to the server?
Dev: – I’m using django ORM. Usually it works quite well.

Ok. Let’s see if the server has a open connection with the server:

$ netstat -na|grep 3306

So, the server has a connection but the app CAN’T use the connection? (The server runs only this worker, so I know this connections belongs to it).

And this is the moment that it’s useful to understand a little bit about the OSI model.

While the servers had a valid TCP connection (layer 4) between each other, since the application only runs once a day for a few minutes and then becomes idle,  MYSQL dropped the session (level 6) but kept the TCP connection established. As the application didn’t get aware that its session is invalid, it kept trying to use that session.

How to solve that? We  close your sessions after running our worker tasks and make sure it opens a new one when necessary. In django it can be achieved by using this code:

from django.db import connections
for conn in connections.all():

Good! Problem solved.
Last question:  why have we never seen this error on other django applications?
Answer: probably because those applications do not stay idle for a long time. As a result, their sessions never get closed by Mysql. And that is a good thing to know when you’re coding a worker instead of a daemon.


How to test network access without telnet or netcat

If you want to check a TCP connection with an specifif host, instead of using telnet or netcat, you can use echo. It’s really simple. Try this:

$ echo "" > /dev/tcp/ && echo "Success"

If the commands returns Success you have access. If the command hangs, you can use Ctrl + C to stop it and now you know you don’t have connectivity.

You can try this using a “wrong” port and then press Ctrl + C:

$ echo "" > /dev/tcp/ && echo "Success"

This is quite useful when you’re dealing with containers or really small vms.


Enforce versions

This is something I’ve learned in my 16 years running IT Operations:

If you want to ensure you know your environment, control your dependencies versions. Enforce versions. Do not use “latest” without testing it.

Even if you have automated tests, even if you have a CI/CD, even if you have 100% of coverage: You can’t ensure something won’t brake.

You can test your entire code base and monitor all your services but you can’t test the entire realms of possibilities in the world!

Things can brake in a unimaginable number of ways. That is why even Google and Facebook have outages once in while.


Variables == ‘single quote’ and logs

Yesterday our team was struggling with an API call for one of our services. The client we developed for our API was getting an “Unauthorized”  message, to a request that worked pretty fine in the dev environment. At first we thought that our token was not valid so we started to see what was happening.

After some unsuccessful tries to solve it, we decide to read again the error message on our API server.  And we found out that it was pretty straight forward:

Identity server rejected authorization necessary to fetch token data

The problem was not our client token but actually our API didn’t have the right credentials to do the token validation

Checking our confs, we found out that one of our passwords had only 6 chars. What? That is not what we do! What happened? This is what happened

$ export API_PASSWORD=123$abc#de

Would like to try this on your terminal? Type this:

$ export API_PASSWORD_ABC=123$abc#de

So, what happened? YES! Your shell thought $abc should be replace by a variable. And since you probably don’t have it, it will replace $abc with an empty string. Ending up with a result like this: 123#de.

So, what did we do?

$ export API_PASSWORD_ABC=’123$abc#de’

Singles quotes! Inside single quotes, everything is literal but single quotes itself and the escape char (\)

Lessons learned:

  1. Be careful when exporting strings. If possible, always use single quotes
  2. LOGS MATTERS! Good log messages can save hours of work!

If you read until here, thanks for you time and if it helped you somehow, or if you would like to point something out, please, feel free to drop me a comment.

Using curl to test 1 server in a cluster


Today I was checking some ssl certificates and I realized that I couldn’t test an specific server on my web cluster server  using this:

$ curl -H ‘Host:’

The reason is SNI. I won’t get into details but you can see more here.

So, to do that request, you can use:

$ curl -v --connect-to

–connect-to seems to be a very useful option to solve that kind of issue.

Generating passwords on command line

Navigating through one those awesome pages on Github, I found this command to generate passwords on linux/bsd/osx:

$ LC_ALL=C tr -d -c "[:alpha:][:alnum:]" < /dev/urandom | head -c 20

Pretty nice, uh? You ran it, it generates a string with 20 characters of number and letters. Yes. But how??? tr is one of those simple commands that sometimes you took time to understand what it does.

I decided that I would understand what it does. It took sometime but I (almost) managed it:

LC_ALL=C => set tr to use C locale. ( I’ll have to write a post about locale.)

=> command to translate or remove characters.

=> delete the following chars

=> Everything but this chars. So, joined with -d,  tr will NOT remove letters and numers

< /dev/urandom
=> generate random numbers and chars and send to tr. However it generate a lot of special characters which are remove by the switches just described.

|head -c 20
=> receive tr output until it reaches 20 chars and prints it.

If you wan to generate a password with special characters you can use this:

$LC_ALL=C tr -d -c "[:print:]" < /dev/urandom | head -c 20


RPM useful commands

If you wanna start to build your own RPMs I recommend this video:

Here I document some useful information you can find on the video.

Package: rpmdevtools

a serie of useful tools to help you building your RPMs.

Package: mock

a tool to help you build rpm for different architectures.

Command:  $ echo ‘%_topdir /your/directory/to/rpmbuild’ > .rpmmacros

set your rpmbuild default directory

Command: $ rpmdev-setuptree

build the tree of directories you will use build your RPMs.

Command: $ rpm –eval ‘%{__python}’

show the value of  %{__python} macro. Something like ‘/usr/bin/python’

Command: $ cd ~/RPMBUILD/SOURCES && spectool -g ../SPECS/python-dateutil.spec

download all the sources on the spec file

Command: $ rpmbuild -bp SPECS/python-dateutil.spec

runs just your %prep step. The %prep can include %setup, %patch and any command before the %build step.

Command: $ rpmbuild [short-circuit] -bc SPECS/python-dateutil.spec

with short-circuit => runs just your %build step, skipping %prep
without short-circuit => runs your %prep and then runs %build

Command: $ rpmbuild [short-circuit] -bi SPECS/python-dateutil.spec

with short-circuit => runs just your %install step, skipping %prep
without short-circuit => runs %prep, %build and then runs %install

Command: $ rpmbuild -bb SPECS/python-dateutil.spec

build the rpm your package

Command: $ rpmbuild -bs SPECS/python-dateutil.spec

build the SOURCE rpm of your package

Command: $ rpmbuild -ba SPECS/python-dateutil.spec

build both rpm package and rpm SOURCE package

Command: $ rpmbuild -ba SPECS/python-dateutil.spec

build both rpm package and rpm SOURCE package


Initiate mock repository
$ mock -r centos-7-x86_64  –init
Install dependencies
$ mock -r centos-7-x86_64  –install …
Buid your package without clean your recently installed dependencies
$ mock -r centos-7-x86_64  –no-clean –rebuild …

Command: mock -r centos-7-x86_64  –shell

prompt a shell on the specific mock config.

PATH: /var/lib/mock/archtecture/result

logs and builded RPMs.