Dead process can't restart, complains address_in_use

We sometimes have hosts with processes that stop working. “status details” shows this. Usually when this happens we can kill the Linux PID for that process and fdbmonitor respawns the process.

In other cases, the PID isn’t running and won’t restart because of this error in the trace log:

<Event Severity="10" Time="1576017300.448982" Type="BindError" Machine="10.23.5.122:4506" ID="0000000000000000" Error="address_in_use" ErrorDescription="Local address in use" ErrorCode="2105" logGroup="default"/>

But tools like lsof or netstat don’t show any process with that port open.

Short of restarting fdb on the host, unclear on how to repair.

We’re running a variant of 5.2.

This may just be a kernel thing. A service stop crashed the host and it rebooted.

This sounds like you’re stuck in TIME_WAIT, which is perhaps reasonable because you’ve hard killed the process, but inconvenient. (Here’s a nice blog post for an overview of TIME_WAIT.)

However, apparently I don’t understand how sockets work anymore. We don’t set SO_REUSEADDR or SO_REUSEPORT, and yet I can repeatedly kill an fdbserver that’s a part of a database, and it comes back without complaining about the port being in use. My memory of doing raw BSD sockets api things before is that SO_REUSEADDR was required to be able to ctrl-c and re-launch without waiting, so I’m confused why we’re not seeing that. I’ve scraped through as much of our prod logs as I can before my queries get timed out, and I can’t find a Type=BindError Error=address_in_use TraceEvent that was logged.

We aren’ty that I know of actively killing these process.

We do have a cronjob that renices things

I chatted with Evan about my inability to comprehend how this isn’t affecting us, and he pointed out that as the error would happen before TraceEvent files are opened, then searching traceevent files obviously wouldn’t help. Paging @ajbeamon, because I’m unsure of the proper way to hunt down reproductions for this.

But I’m not aware that processes disappearing off indefinitely with BindError is a thing we’ve ever seen.

And just for transparency, I was wrong about renice. It’s a cronjob:

*/5 * * * * /opt/wavefront/repo/tools/oomAdjust.sh -f 'grep [m]nt/fdb/46' -s -1000

Which is a 116 line shell script that just runs:

sudo echo $OOM_SCORE_ADJ > /proc/$pid/oom_score_adj

$OOM_SCORE_ADJ is “-1000” in this example.

Mentioning because this may or may not be a useful tip for others.

I’d be surprised if we were seeing this, since we rely on processes dying and coming back up with the same ports very quickly. If we were seeing even a small period where the processes couldn’t bind, I would expect that the processes would fail, and fdbmonitor would then wait 60 seconds before trying again.

fdbserver should print to stderr when this happens, and if you are using fdbmonitor you could check its logs for this error message.

This is, more or less, what happens. Except in a few cases where syslog and the trace logs are littered with restart failures.

Short of a reboot not sure how to clear the socket.