Dead process can't restart, complains address_in_use

mrz · December 10, 2019, 10:44pm

We sometimes have hosts with processes that stop working. “status details” shows this. Usually when this happens we can kill the Linux PID for that process and fdbmonitor respawns the process.

In other cases, the PID isn’t running and won’t restart because of this error in the trace log:

<Event Severity="10" Time="1576017300.448982" Type="BindError" Machine="10.23.5.122:4506" ID="0000000000000000" Error="address_in_use" ErrorDescription="Local address in use" ErrorCode="2105" logGroup="default"/>

But tools like lsof or netstat don’t show any process with that port open.

Short of restarting fdb on the host, unclear on how to repair.

We’re running a variant of 5.2.

mrz · December 10, 2019, 10:54pm

This may just be a kernel thing. A service stop crashed the host and it rebooted.

alexmiller · December 11, 2019, 1:10am

This sounds like you’re stuck in TIME_WAIT, which is perhaps reasonable because you’ve hard killed the process, but inconvenient. (Here’s a nice blog post for an overview of TIME_WAIT.)

However, apparently I don’t understand how sockets work anymore. We don’t set SO_REUSEADDR or SO_REUSEPORT, and yet I can repeatedly kill an fdbserver that’s a part of a database, and it comes back without complaining about the port being in use. My memory of doing raw BSD sockets api things before is that SO_REUSEADDR was required to be able to ctrl-c and re-launch without waiting, so I’m confused why we’re not seeing that. I’ve scraped through as much of our prod logs as I can before my queries get timed out, and I can’t find a Type=BindError Error=address_in_use TraceEvent that was logged.

mrz · December 11, 2019, 2:29am

We aren’ty that I know of actively killing these process.

We do have a cronjob that renices things

alexmiller · December 11, 2019, 2:32am

I chatted with Evan about my inability to comprehend how this isn’t affecting us, and he pointed out that as the error would happen before TraceEvent files are opened, then searching traceevent files obviously wouldn’t help. Paging @ajbeamon, because I’m unsure of the proper way to hunt down reproductions for this.

But I’m not aware that processes disappearing off indefinitely with BindError is a thing we’ve ever seen.

mrz · December 11, 2019, 4:35am

And just for transparency, I was wrong about renice. It’s a cronjob:

*/5 * * * * /opt/wavefront/repo/tools/oomAdjust.sh -f 'grep [m]nt/fdb/46' -s -1000

Which is a 116 line shell script that just runs:

sudo echo $OOM_SCORE_ADJ > /proc/$pid/oom_score_adj

$OOM_SCORE_ADJ is “-1000” in this example.

Mentioning because this may or may not be a useful tip for others.

john_brownlee · December 11, 2019, 3:48pm

I’d be surprised if we were seeing this, since we rely on processes dying and coming back up with the same ports very quickly. If we were seeing even a small period where the processes couldn’t bind, I would expect that the processes would fail, and fdbmonitor would then wait 60 seconds before trying again.

ajbeamon · December 11, 2019, 4:06pm

fdbserver should print to stderr when this happens, and if you are using fdbmonitor you could check its logs for this error message.

mrz · December 11, 2019, 4:49pm

This is, more or less, what happens. Except in a few cases where syslog and the trace logs are littered with restart failures.

Short of a reboot not sure how to clear the socket.

Topic		Replies	Views
Processes OOM, fdbmonitor doesn't restart Using FoundationDB	4	2779	January 2, 2020
Problem with upgrade to 6.2.15 from 6.1.x version Using FoundationDB	3	666	March 30, 2020
ERROR: Address could not be bound Running FoundationDB	0	413	April 3, 2022
Foundationdb seems keep reinitialising itself after client hang Using FoundationDB	5	1217	June 26, 2019
Fdbmonitor starting identical copies of itself instead of fdbserver process (5.2.5, RHEL) Using FoundationDB	3	840	July 5, 2018

Dead process can't restart, complains address_in_use

Related topics