We recently ran into a cluster down situation where all of our coordinators had issues at the same time causing the database to be unavailable.
For reference, we are running FDB version 6.0.15 with three datacenter mode with 54 nodes (18 nodes in each AZ), 5 coordinators in a 2, 2, 1 configuration and the nodes are c5d.2xlarge with 4500 being storage, 4501 being storage, 4502 being transaction and 4503 being stateless for each machine.
The following was being spammed over and over until I restarted foundationdb on every coordination server:
Jul 25 16:50:30 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:50:31 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 3797 exited 1, restarting in 0 seconds
Jul 25 16:50:31 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (21795) for fdbserver.4500
Jul 25 16:50:32 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": FDBD joined cluster.
Jul 25 16:50:34 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:50:34 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 21795 exited 1, restarting in 57 seconds
Jul 25 16:51:31 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (21800) for fdbserver.4500
Jul 25 16:51:31 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:51:32 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 21800 exited 1, restarting in 56 seconds
Jul 25 16:51:54 ip-10-49-38-6.ec2.internal dhclient[2694]: XMT: Solicit on eth0, interval 117660ms.
Jul 25 16:52:28 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (21805) for fdbserver.4500
Jul 25 16:52:28 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:52:28 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 21805 exited 1, restarting in 65 seconds
Jul 25 16:53:33 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (21811) for fdbserver.4500
Jul 25 16:53:34 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:53:34 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 21811 exited 1, restarting in 62 seconds
Jul 25 16:53:52 ip-10-49-38-6.ec2.internal dhclient[2694]: XMT: Solicit on eth0, interval 126620ms.
Jul 25 16:54:36 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (21816) for fdbserver.4500
Jul 25 16:54:36 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Fatal Error: Network connection failed
Jul 25 16:54:36 ip-10-49-38-6.ec2.internal fdbmonitor[3794]: LogGroup="default" Process="fdbserver.4500": Process 21816 exited 1, restarting in 57 seconds
I tried recreating this behavior in another environment (which runs 6.1.8) by blocking port 4500 input and output using iptables on a hunch but was only able to get the following logs which are not the same (and quite frankly are expected:
Jul 25 20:11:51 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (2218) for fdbserver.4500
Jul 25 20:11:51 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="backup_agent.1": Launching /usr/lib/foundationdb/backup_agent/backup_agent (2217) for backup_agent.1
Jul 25 20:11:51 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4503": Launching /usr/sbin/fdbserver (2221) for fdbserver.4503
Jul 25 20:11:53 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4502": FDBD joined cluster.
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4500": Warning: FDBD has not joined the cluster after 5 seconds.
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4500": Check configuration and availability using the 'status' command with the fdbcli
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4503": Warning: FDBD has not joined the cluster after 5 seconds.
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4503": Check configuration and availability using the 'status' command with the fdbcli
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4501": Warning: FDBD has not joined the cluster after 5 seconds.
Jul 25 20:11:56 ip-10-49-58-13.ec2.internal fdbmonitor[2216]: LogGroup="default" Process="fdbserver.4501": Check configuration and availability using the 'status' command with the fdbcli
I was curious if anybody had any ideas what could have happened as the journald and dmesg log output has nothing useful.
To be specific what could cause Fatal Error: Network connection failed
from a storage class over and over on a coordinator (if that is even relevant here)? I see that this particular error (1026 in the code) is not easily reproducible for me by closing ports via a reject or drop iptables rule.