OK, the dashboard was helpful and I’ve got a more detailed timeline of what was happening. It turns out that we had 3 recoveries in one instance, but there are still some things that are puzzling about it.  Here’s a timeline, and some logs that I found:
- 00:35:37.422 - I manually deleted one of the log process pods
- 00:35:37.571 - other FDB machines start getting N2_ConnectError and N2_ReadError with the machine I killed as the peer
- 00:35:39 -  First downtime period starts - our load testing service stops being able to read / commit transactions
- 00:35:39.250 - The operator starts to recreate the pod
- 00:35:42 - First downtime period stops - our load testing service is again able to make transactions.
- 00:35:42.550 - We get our first WaitFailureClientfrom the cluster controller. This is accompanied by aClusterRecoveryRetrying. What’s a bit strange about this log is that there is no log indicating a recovery has started. The first recovery I see only happens after this log line.
{
    "DateTime": "2025-09-18T16:35:42Z",
    "Error": "tlog_failed",
    "ErrorCode": "1205",
    "ErrorDescription": "Cluster recovery terminating because a TLog failed",
    "ID": "3459f5d1f72f910d",
    "LogGroup": "foundationdb-cluster-main",
    "Machine": "10.0.208.42:4500",
    "Roles": "CC",
    "Severity": "20",
    "ThreadID": "16997965542817237771",
    "Time": "1758213342.550678",
    "Type": "ClusterRecoveryRetrying"
}
- 00:35:42.587 - MasterRecoveryStatewithreading_coordinated_state- I think this is when the first recovery starts
- 00:35:42.598 - We get a “Not enough physical servers available”. This occurs with different types of trace logs (RecruitStorageNotAvailable,CCWDB, andClusteryRecoveryRetrying). and occurs quite a lot until the cluster fully recovers ~20 seconds later
{
    "DateTime": "2025-09-18T16:35:44Z",
    "Error": "no_more_servers",
    "ErrorCode": "1008",
    "ErrorDescription": "Not enough physical servers available",
    "ID": "35163ad87d48095f",
    "LogGroup": "foundationdb-cluster-main",
    "Machine": "10.0.208.42:4500",
    "Roles": "CC",
    "Severity": "20",
    "ThreadID": "16997965542817237771",
    "Time": "1758213344.776349",
    "Type": "ClusterRecoveryRetrying"
}
- 00:35:42.735 - MasterRecoveryDuration- shows the recovery finishes in 148ms
- 00:35:44.620 - RestartingTxnSubsystem- with stageAwaitCommit. Maybe I am reading this line wrong, but we just did a recovery, so why is the transaction subsystem restarting?
- 00:35:44.812 - Another recovery starts
- 00:35:45.212 - Not sure if this is an error, but it was in the Splunk dashboard:
{
    "DateTime": "2025-09-18T16:35:45Z",
    "Error": "operation_failed",
    "ErrorCode": "1000",
    "ErrorDescription": "Operation failed",
    "GoodRecruitmentTimeReady": "0",
    "ID": "41168f4fb8efbfb1",
    "LogGroup": "foundationdb-cluster-main",
    "Machine": "10.0.120.245:4500",
    "Roles": "CC,DD",
    "Severity": "10",
    "ThreadID": "17276620476305664854",
    "Time": "1758213345.212184",
    "Type": "RecruitFromConfigurationRetry"
}
- 00:35:46 - Second period of downtime starts - this one is 14 seconds
- 00:35:46.915 - The second recovery finishes. This one takes 2 seconds
- 00:35:49.705 - Sometime around here, the recreated pod starts up and rejoins the cluster.  I see the Net2Startingand other log lines from the node. I see theWorkerRegisterlog at 35:53.369, after the next recovery starts
- 00:35:51.258 - There seems to be another recovery here – several different nodes report a MasterRecoveryState, but I never get aMasterRecoveryDurationlog from this
- 00:36:00 - Second period of downtime end
- 00:36:00.191 - The third recovery finishes. This one takes 357ms
In total, there was 17 seconds of downtime:
- 3 seconds initially when the log process was dead, but it seems that the cluster didn’t detect the pod was dead.
- 14 seconds around the second recovery. I’m not sure why this one was so long
Some questions that I have from my investigations:
- I notice that there is a knob for TLOG_TIMEOUTand the default value is 0.4. So why did it take almost 5 seconds to get theWaitFailureClientfor the killed log process?
- Why are there so many recoveries, and maybe a couple of failed recoveries?
- Does the “Not enough physical servers available” error indicate a configuration issue in our cluster topology?
I will continue to dig at this.