Cluster unavailable after power outage

Hello,

After a power outage at our datacentre, where all five nodes got caught in a reboot loop after coming online, our FoundationDB cluster is now offline.

The overall configuration is as follows:

  • FoundationDB 6.2.8 (yes, I am aware - really old version)
  • 5 node cluster
  • Running in triple redundancy (ssd)
  • 5 proxies and 15 logs

All nodes are identical, 64 cores, 512GB RAM, 17 NVME drives and a dedicated boot drive.

Unfortunately the person who origionally set up the cluster is no longer with the company, and sadly no one currently have much knowledge/experience with FDB. But I am hoping that by sharing the below data that someone can point us in the right direction. I have hopes that we might get this back online, but I am also facing the dreadful reality that we’re in a stale mate from which there is no recovery from.

I suspect that additonal information is required, and I am more than willing to share whatever logs to facilitate this.

Here is the overview of the different classes and their data directories.

ps aux | awk '
{
    class=""; data=""
    for(i=1; i<=NF; i++) {
        if($i == "--class") class=$(i+1)
        if($i == "--datadir") data=$(i+1)
    }
    if(class || data) print "Class:", class, " | Datadir:", data
}'

Class: storage  | Datadir: /var/lib/foundationdb/data/nvme0n1/4500
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme1n1/4501
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme2n1/4502
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme3n1/4503
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme4n1/4504
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme5n1/4505
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme6n1/4506
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme7n1/4507
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme8n1/4508
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme9n1/4509
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme10n1/4510
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme12n1/4511
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme12n1/4512
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme12n1/4513
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme13n1/4514
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme13n1/4515
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme13n1/4516
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme14n1/4517
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme14n1/4518
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme14n1/4519
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme15n1/4520
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme15n1/4521
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme15n1/4522
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4523
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4524
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4525
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4526
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4527
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4528
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4529
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4530
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4531
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4532
Class: transaction  | Datadir: /var/lib/foundationdb/data/nvme11n1/4533
Class: transaction  | Datadir: /var/lib/foundationdb/data/nvme11n1/4534
Class: transaction  | Datadir: /var/lib/foundationdb/data/nvme11n1/4535
Class: stateless  | Datadir: /var/lib/foundationdb/data/4536
Class: stateless  | Datadir: /var/lib/foundationdb/data/4537
Class: stateless  | Datadir: /var/lib/foundationdb/data/4538
Class: stateless  | Datadir: /var/lib/foundationdb/data/4539
Class: test  | Datadir: /var/lib/foundationdb/data/4540
Class: test  | Datadir: /var/lib/foundationdb/data/4541
Class: test  | Datadir: /var/lib/foundationdb/data/4542
Class: test  | Datadir: /var/lib/foundationdb/data/4543
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4544
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4545
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4546
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4547
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4548
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4549
Class: storage  | Datadir: /var/lib/foundationdb/data/nvme16n1/4550

Running ‘fdbcli --exec “status details”’ returns the following:

fdbcli --exec "status details"

WARNING: Long delay (Ctrl-C to interrupt)
Using cluster file `/etc/foundationdb/fdb.cluster'.

Recovering transaction server state. Verify that the transaction server
processes are active.

Unable to locate the data distributor worker.

Unable to locate the ratekeeper worker.

Current active roles in cluster as of morning 2026-07-01

fdbcli --exec "status json" | jq -r '
  [.cluster.processes[] | {ip: (.address | split(":")[0]), port: (.address | split(":")[1]), roles: ([.roles[].role] | join(","))}]
  | group_by(.ip)[]
  | (.[0].ip + ":") as $ip
  | $ip, map("  - Port " + .port + " -> [" + (if .roles == "" then "none" else .roles end) + "]")[]
' | grep -v none
10.0.0.181:
  - Port 4502 -> [coordinator]
10.0.0.182:
  - Port 4536 -> [master]
  - Port 4505 -> [coordinator]
10.0.0.183:
  - Port 4527 -> [coordinator]
10.0.0.184:
  - Port 4505 -> [coordinator]
  - Port 4537 -> [cluster_controller]
10.0.0.185:
  - Port 4516 -> [coordinator]

As we can clearly see, the cluster is NOT in a good shape, missing many roles.

And here is an overview of the messages with severity of > 20 in the trace logs, per node

NODE 1

grep -E 'Severity="(20|30|40|50)"' trace*.xml | grep -Eo 'Type="[^"]*?"' | sort | uniq -c | sort -rn
grep: trace.10.0.0.181.56974.1613474518.4S7hxD.4.10966.xml: binary file matches
 171616 Type="SlowSSLoopx100"
   8723 Type="FetchKeysTooLong"
   6099 Type="N2_ConnectError"
   6067 Type="N2_ReadError"
   1681 Type="N2_ReadProbeError"
    751 Type="Net2SlowTaskTrace"
     22 Type="TooManyConnectionsClosed"
     18 Type="RemovedDeadBackupLayerStatus"
     12 Type="LoadBalanceTooLongEndpoint"
      4 Type="LoadBalanceTooLong"
      4 Type="FetchPast"
      2 Type="SlowTask"
      1 Type="StatFailed"

NODE 2

grep -E 'Severity="(20|30|40|50)"' trace*.xml | grep -Eo 'Type="[^"]*?"' | sort | uniq -c | sort -rn
grep: trace.10.0.0.182.26581.1741965497.oLKrDv.3.4489.xml: binary file matches
 134101 Type="SlowSSLoopx100"
 117917 Type="N2_ConnectError"
   8309 Type="FetchKeysTooLong"
   6045 Type="N2_ReadError"
   1712 Type="N2_ReadProbeError"
    691 Type="Net2SlowTaskTrace"
     78 Type="StatFailed"
     33 Type="RemovedDeadBackupLayerStatus"
     22 Type="TooManyConnectionsClosed"
     21 Type="FetchPast"
      3 Type="LoadBalanceTooLongEndpoint"
      1 Type="Rollback"
      1 Type="N2_WriteError"
      1 Type="LoadBalanceTooLong"
      1 Type="IncorrectClusterFileContents"

NODE 3

grep -E 'Severity="(20|30|40|50)"' trace*.xml | grep -Eo 'Type="[^"]*?"' | sort | uniq -c | sort -rn
 160083 Type="SlowSSLoopx100"
  22637 Type="N2_ConnectError"
   8674 Type="FetchKeysTooLong"
   5876 Type="N2_ReadError"
   1215 Type="N2_ReadProbeError"
    669 Type="Net2SlowTaskTrace"
     20 Type="TooManyConnectionsClosed"
     18 Type="RemovedDeadBackupLayerStatus"
      7 Type="SlowTask"
      6 Type="LoadBalanceTooLongEndpoint"
      2 Type="LoadBalanceTooLong"

NODE 4

grep -E 'Severity="(20|30|40|50)"' trace*.xml | grep -Eo 'Type="[^"]*?"' | sort | uniq -c | sort -rn
 150681 Type="SlowSSLoopx100"
  10735 Type="N2_ReadError"
   6849 Type="N2_ConnectError"
   4300 Type="FetchKeysTooLong"
   1272 Type="N2_ReadProbeError"
    706 Type="Net2SlowTaskTrace"
     36 Type="TooManyConnectionsClosed"
     30 Type="RemovedDeadBackupLayerStatus"
     10 Type="StatFailed"
      1 Type="SlowTask"
      1 Type="FailureMonitorClientSlow"

NODE 5

grep -E 'Severity="(20|30|40|50)"' trace*.xml | grep -Eo 'Type="[^"]*?"' | sort | uniq -c | sort -rn
grep: trace.10.0.0.185.3130.1664879586.1vGWAN.0.1.xml: binary file matches
grep: trace.10.0.0.185.3183.1664880237.ULnrJC.0.1.xml: binary file matches
grep: trace.10.0.0.185.63838.1613474744.2MkLDK.3.5774.xml: binary file matches
 146002 Type="SlowSSLoopx100"
  26800 Type="N2_ConnectError"
   7743 Type="N2_ReadError"
   1433 Type="N2_ReadProbeError"
    711 Type="Net2SlowTaskTrace"
    275 Type="FetchKeysTooLong"
     30 Type="RemovedDeadBackupLayerStatus"
     15 Type="StatFailed"
      2 Type="SlowTask"

I did notice the N2_ConnectError type errors, and did notice that once the servers came out of the reboot loop, the Broadcom NetExtreme-E cards were having issues keeping up with the amount of communication between the nodes, as they dropped out/were reset half a dozen times on a few of the nodes which I think might have made matters worse.

Any pointers as to what might be wrong, what to do next would be greatly appreciated.