We sometimes have hosts with processes that stop working. “status details” shows this. Usually when this happens we can kill the Linux PID for that process and fdbmonitor respawns the process.
In other cases, the PID isn’t running and won’t restart because of this error in the trace log:
This sounds like you’re stuck in TIME_WAIT, which is perhaps reasonable because you’ve hard killed the process, but inconvenient. (Here’s a nice blog post for an overview of TIME_WAIT.)
However, apparently I don’t understand how sockets work anymore. We don’t set SO_REUSEADDR or SO_REUSEPORT, and yet I can repeatedly kill an fdbserver that’s a part of a database, and it comes back without complaining about the port being in use. My memory of doing raw BSD sockets api things before is that SO_REUSEADDR was required to be able to ctrl-c and re-launch without waiting, so I’m confused why we’re not seeing that. I’ve scraped through as much of our prod logs as I can before my queries get timed out, and I can’t find a Type=BindError Error=address_in_use TraceEvent that was logged.
I chatted with Evan about my inability to comprehend how this isn’t affecting us, and he pointed out that as the error would happen before TraceEvent files are opened, then searching traceevent files obviously wouldn’t help. Paging @ajbeamon, because I’m unsure of the proper way to hunt down reproductions for this.
But I’m not aware that processes disappearing off indefinitely with BindError is a thing we’ve ever seen.
I’d be surprised if we were seeing this, since we rely on processes dying and coming back up with the same ports very quickly. If we were seeing even a small period where the processes couldn’t bind, I would expect that the processes would fail, and fdbmonitor would then wait 60 seconds before trying again.