All Coordinators Crashed At Same Time

We recently ran into a cluster down situation where all of our coordinators had issues at the same time causing the database to be unavailable.

For reference, we are running FDB version 6.0.15 with three datacenter mode with 54 nodes (18 nodes in each AZ), 5 coordinators in a 2, 2, 1 configuration and the nodes are c5d.2xlarge with 4500 being storage, 4501 being storage, 4502 being transaction and 4503 being stateless for each machine.

The following was being spammed over and over until I restarted foundationdb on every coordination server:

Jul 25 16:50:30 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:50:31 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 3797 exited 1, restarting in 0 seconds
Jul 25 16:50:31 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (21795) for fdbserver.4500
Jul 25 16:50:32 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": FDBD joined cluster.
Jul 25 16:50:34 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:50:34 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 21795 exited 1, restarting in 57 seconds
Jul 25 16:51:31 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (21800) for fdbserver.4500
Jul 25 16:51:31 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:51:32 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 21800 exited 1, restarting in 56 seconds
Jul 25 16:51:54 ip-10-49-38-6.ec2.internal dhclient[2694]: XMT: Solicit on eth0, interval 117660ms.
Jul 25 16:52:28 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (21805) for fdbserver.4500
Jul 25 16:52:28 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:52:28 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 21805 exited 1, restarting in 65 seconds
Jul 25 16:53:33 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (21811) for fdbserver.4500
Jul 25 16:53:34 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:53:34 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 21811 exited 1, restarting in 62 seconds
Jul 25 16:53:52 ip-10-49-38-6.ec2.internal dhclient[2694]: XMT: Solicit on eth0, interval 126620ms.
Jul 25 16:54:36 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (21816) for fdbserver.4500
Jul 25 16:54:36 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:54:36 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 21816 exited 1, restarting in 57 seconds

I tried recreating this behavior in another environment (which runs 6.1.8) by blocking port 4500 input and output using iptables on a hunch but was only able to get the following logs which are not the same (and quite frankly are expected:

Jul 25 20:11:51 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (2218) for fdbserver.4500
Jul 25 20:11:51 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="backup_agent.1": Launching /usr/lib/foundationdb/backup_agent/backup_agent (2217) for backup_agent.1
Jul 25 20:11:51 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4503": Launching /usr/sbin/fdbserver (2221) for fdbserver.4503
Jul 25 20:11:53 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4502": FDBD joined cluster.
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4500": Warning: FDBD has not joined the cluster after 5 seconds.
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4500":   Check configuration and availability using the 'status' command with the fdbcli
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4503": Warning: FDBD has not joined the cluster after 5 seconds.
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4503":   Check configuration and availability using the 'status' command with the fdbcli
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4501": Warning: FDBD has not joined the cluster after 5 seconds.
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4501":   Check configuration and availability using the 'status' command with the fdbcli

I was curious if anybody had any ideas what could have happened as the journald and dmesg log output has nothing useful.

To be specific what could cause Fatal Error: Network connection failed from a storage class over and over on a coordinator (if that is even relevant here)? I see that this particular error (1026 in the code) is not easily reproducible for me by closing ports via a reject or drop iptables rule.

That is… very strange.

Can you pastebin your tracelogs from one of the coordinators during this time? The output from fdbmonitor is too limited to divine what might have happened.

Do you mind if I tar up the logs for each of these processes and attach them via a file sharing link?

Went ahead and uploaded them. It is a tar.gz over two tar.gz files of the different logs from one machine. This file is 59 MB and I apologize about the spammy looking link in advance as it was the first site I found for sharing files without registering (also apparently it deletes itself in 7 days): https://we.tl/t-yVNUFnnp4F

If that doesn’t work just let me know what is preferential and I’ll make that work.

Worked for me!

<Event Severity="20" Time="1564073857.327253" Type="N2_AcceptError" ID="0000000000000000" SuppressedEventCount="0" Message="24" Machine="10.49.38.6:4500" LogGroup="default" Roles="SS" />
<Event Severity="40" Time="1564073857.327253" Type="ListenError" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0x134e1f3 0x134e5f7 0x135b4d9 0x1440d20 0x1440f84 0x465599 0x143fa45 0x143fda7 0x1437abe 0x1438543 0x43a4a7 0x7fa2f3e6202a" Machine="10.49.38.6:4500" LogGroup="default" Roles="SS" />
<Event Severity="40" Time="1564073857.327253" Type="StopAfterError" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0x94cde7 0x94d0f7 0x45cd09 0x52db2d 0x45cd09 0x134e25a 0x134e5f7 0x135b4d9 0x1440d20 0x1440f84 0x465599 0x143fa45 0x143fda7 0x1437abe 0x1438543 0x43a4a7 0x7fa2f3e6202a" Machine="10.49.38.6:4500" LogGroup="default" Roles="SS" />

Shows that calling accept() returned error 24, which according to errno.h:

#define EMFILE    24  /* Too many open files */

And man 2 accept says:

EMFILE The per-process limit on the number of open file descriptors has been reached.

… which is a bit strange. Are you sure that whatever environment you’re running in hasn’t rlimit()'d you down in the number of open file descriptors you can have?

If you have a coordinator that’s running right now, and you ls -l /proc/<pid>/fd/, do you see anything strange?

Thinking about your logs slightly more it does greatly concern me that you were crash looping with this error.

Having too many open files in a relatively new process would suggest that you were inheriting file descriptors from fdbmonitor, which strongly shouldn’t be happening. If that is the case, you should be able to see a large number of file descriptors pointing to fdbmontior log files… or something… from an fdbserver process.

However, it’s also odd that this spontaneously happened after your log had rolled 628 times, implying your server had been alive for quite some time, and then happened immediately upon restart many times after that. This really kind of feels like something was walking along and suddenly applied a rather forceful limit to the number of files each process on this system is allowed to have open…

Hey Alex,

Thanks for taking a look and letting me know it had to do with too many open files as that is a pretty big clue. I have to admit the trace logs look a bit daunting at first.

I see that the limits of the fdbmonitor show the following:

Max open files            1024                 4096                 files

And for 4500 since the fdbmonitor started the process it will inherit the limits from it:

Max open files            1024                 4096                 files

Looking in the fd directories nothing really stands out. A few pipes and a lot of sockets which is kind of expected.

Here you can download and see an ls from the /proc/<pid>/fd directory for the fdbmonitor and 4500/4501/4502/4503 started by it along with the limits file of each of those processes: https://we.tl/t-e5n6kF2seY

FDB is really the only thing running on this node and has been running for a few months at least without issue so I’m not sure if something suddenly changed the rlimits (which I doubt) but then makes me question if something went awry and a bunch of new sockets were spawned causing it to exceed the file limits for the process. I’m just really perplexed how that then bubbled into all 5 coordinators in different AZs having the same Too many open files at the same timestamp. Confusion all around!

However, there is some credence to the thought that something was modified due to failures on restarts. And then magically after another restart (and after the last coordinator in the list was restarted) things returned to normal.

So this definitely is related to the amount of open file handles as I tried doing the following:

  1. prlimit --pid <fdb_monitor_pid> --nofile=50:50
  2. kill <pid_of_4500>

Then when fdbmonitor tries to restart 4500 it dies with the Fatal Error: Network connection failed.

But the logs prior to this actual incidence did not show any restart attempts by the fdbmonitor of the child processes and just started with:

Jul 25 16:50:30 ip-10-49-38-40.ec2.internal fdbmonitor[3779]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed

Still really perplexed.

In a cough cough hypothetical environment where FDB stably runs large workloads, I have:

$ ulimit -n
262144

and an arbitrarily chosen process shows:

$ sudo ls -l /proc/3611/fd | wc -l
1116

So this would be why I haven’t seen this happen, personally. People with other large FDB deployments, @mrz / @markus.pilman, I assume you’ve had to adjust the allowable number of file descriptors upwards as well? If so, we need to get this loudly added to our docs, and probably as a warning in fdbcli status.

All clients and fdbserver processes can (and will, at some point) connect to a coordinator. I’m going to assume that you’ve grown your usage to the point where having ~1000 processes connected to a coordinator is possible, and thus you hit your limit? If so, this could make a lot of sense, otherwise, I need to keep pondering…

The behavior here of coordinators (or what would soon also be the cluster controller) crash looping is clearly suboptimal. I never mentally connected before that the linux default for number of file descriptors is relatively low, and keeping a connection open to every client or process would quickly burn through that for a large cluster. This can be changed to not crash and reject connections, but that’s not… always going to be a better solution. (If clients take up the first 1024 slots of connections, then no fdbserver processes will be able to join the cluster for there to actually be an FDB cluster.)

I’ll ask around with more ops-related folks to see if there’s something wiser to do here.

We do have “max” file descriptors and no default unlimit settings. And a lot of TCP sysctl settings.

Slide 17 in 2018’s summit talk has some of those.

YMMV - my environment isn’t hypothetical :slight_smile:

Ah yes, running out of file descriptors…

It took us quite a while to run into this issue (also because we do set the ulimit numbers higher by default). In our case only some random coordinator would occasionally run out of file descriptors (and this was also a storage node - which made debugging even weirder). I think the reason it behaved less deterministic is because on AWS ping latencies over 1s happen relatively often and therefore not all clients are connected at all times.

As you said, setting the ulimit is probably something we should document at a very prominent place. One thing that makes this also weird is that fdbmonitor by default runs as root and the fdbserver server process runs as an other user (or at least should run as another user than root). The limits are different for each of them.

So my biggest takeaway going forward is that we will set file descriptor limits at 32768 (something a lot larger than 1024/4096) or so at a minimum for the fdbmonitor so the child processes can inherit that.

However, I’m still really curious how all 5 coordinators tipped over at the same time and why restarting them didn’t seem to clear anything up until they were all restarted? It just seemed very errant as I’d expect to lose a random coordinator here or there with such a small ulimit.

But I will definitely second adding a note to increase ulimits to the docs and maybe a note about this particular exception because the error is misleading at first glance?

Also maybe ship a larger LimitNOFILE setting in the systemd service file for foundationdb? Or mention that a /etc/systemd/system/servicename.d/override.conf can be made containing the following to survive upgrades?

[Service]
LimitNOFILE=65536

Client read from the coordinators to figure out how to connect to the rest of the cluster. Clients do a quorum read, which involves connecting to all the coordinators. So as soon as your number of fdbserver processes and client processes exceeded 1024, then the 1025th connection caused all your coordinators to crash. When they came back up, the 1025 processes would try reconnect, and thus trigger the issue again. I assume that whatever you did to fully restart your FDB server also reduced the number of clients or servers?

We’re about to cut a 6.2 release, so I will probably not try to back port the warnings to 6.1, but I’ll work on getting something merged to loudly warn about file descriptor limits. 6.2 already has some more changes to connection management that should lessen how many connections coordinators have at any time.

This is a great idea, that I had not thought of. I’ll do that. Thanks!

Hey Alex,

Thanks for clearing that up – that actually makes a lot of sense and I just wasn’t putting the pieces together that the quorum read caused all the coordinators to crash when 1024 was exceeded in this sort of cascading way.

I restarted every FDB server in that cluster which probably contributed to lowering the connection count along with the applications reading/writing to FDB giving up on retries. Going forward I’ll increase the limits for FDB and hopefully never see this again (at least in the same way across all the coordinators :sweat_smile:)

Looking forward to the changes in 6.2 and some pre-flight checks would be useful – thinking of how k8s won’t even start if swap is on – FDB could fail to start warning that xyz is too low of a FD limit or at least just spit out some loud warnings as you mentioned.

Anyway thanks again!

I had forgotten about this for a bit, but I’ve finally filed an issue for it #2077 to make sure that it does actually get addressed. Coincidentally, 6.2 addressed some connection scaling issues. Clients will now drop connections to coordinators after bootstrapping, which should help reduce the chance of this being a problem, but definitely something we still need to highlight as an availability concern.