OK, the dashboard was helpful and I’ve got a more detailed timeline of what was happening. It turns out that we had 3 recoveries in one instance, but there are still some things that are puzzling about it. Here’s a timeline, and some logs that I found:
- 00:35:37.422 - I manually deleted one of the log process pods
- 00:35:37.571 - other FDB machines start getting N2_ConnectError and N2_ReadError with the machine I killed as the peer
- 00:35:39 - First downtime period starts - our load testing service stops being able to read / commit transactions
- 00:35:39.250 - The operator starts to recreate the pod
- 00:35:42 - First downtime period stops - our load testing service is again able to make transactions.
- 00:35:42.550 - We get our first
WaitFailureClient
from the cluster controller. This is accompanied by a ClusterRecoveryRetrying
. What’s a bit strange about this log is that there is no log indicating a recovery has started. The first recovery I see only happens after this log line.
{
"DateTime": "2025-09-18T16:35:42Z",
"Error": "tlog_failed",
"ErrorCode": "1205",
"ErrorDescription": "Cluster recovery terminating because a TLog failed",
"ID": "3459f5d1f72f910d",
"LogGroup": "foundationdb-cluster-main",
"Machine": "10.0.208.42:4500",
"Roles": "CC",
"Severity": "20",
"ThreadID": "16997965542817237771",
"Time": "1758213342.550678",
"Type": "ClusterRecoveryRetrying"
}
- 00:35:42.587 -
MasterRecoveryState
with reading_coordinated_state
- I think this is when the first recovery starts
- 00:35:42.598 - We get a “Not enough physical servers available”. This occurs with different types of trace logs (
RecruitStorageNotAvailable
, CCWDB
, and ClusteryRecoveryRetrying
). and occurs quite a lot until the cluster fully recovers ~20 seconds later
{
"DateTime": "2025-09-18T16:35:44Z",
"Error": "no_more_servers",
"ErrorCode": "1008",
"ErrorDescription": "Not enough physical servers available",
"ID": "35163ad87d48095f",
"LogGroup": "foundationdb-cluster-main",
"Machine": "10.0.208.42:4500",
"Roles": "CC",
"Severity": "20",
"ThreadID": "16997965542817237771",
"Time": "1758213344.776349",
"Type": "ClusterRecoveryRetrying"
}
- 00:35:42.735 -
MasterRecoveryDuration
- shows the recovery finishes in 148ms
- 00:35:44.620 -
RestartingTxnSubsystem
- with stage AwaitCommit
. Maybe I am reading this line wrong, but we just did a recovery, so why is the transaction subsystem restarting?
- 00:35:44.812 - Another recovery starts
- 00:35:45.212 - Not sure if this is an error, but it was in the Splunk dashboard:
{
"DateTime": "2025-09-18T16:35:45Z",
"Error": "operation_failed",
"ErrorCode": "1000",
"ErrorDescription": "Operation failed",
"GoodRecruitmentTimeReady": "0",
"ID": "41168f4fb8efbfb1",
"LogGroup": "foundationdb-cluster-main",
"Machine": "10.0.120.245:4500",
"Roles": "CC,DD",
"Severity": "10",
"ThreadID": "17276620476305664854",
"Time": "1758213345.212184",
"Type": "RecruitFromConfigurationRetry"
}
- 00:35:46 - Second period of downtime starts - this one is 14 seconds
- 00:35:46.915 - The second recovery finishes. This one takes 2 seconds
- 00:35:49.705 - Sometime around here, the recreated pod starts up and rejoins the cluster. I see the
Net2Starting
and other log lines from the node. I see the WorkerRegister
log at 35:53.369, after the next recovery starts
- 00:35:51.258 - There seems to be another recovery here – several different nodes report a
MasterRecoveryState
, but I never get a MasterRecoveryDuration
log from this
- 00:36:00 - Second period of downtime end
- 00:36:00.191 - The third recovery finishes. This one takes 357ms
In total, there was 17 seconds of downtime:
- 3 seconds initially when the log process was dead, but it seems that the cluster didn’t detect the pod was dead.
- 14 seconds around the second recovery. I’m not sure why this one was so long
Some questions that I have from my investigations:
- I notice that there is a knob for
TLOG_TIMEOUT
and the default value is 0.4. So why did it take almost 5 seconds to get the WaitFailureClient
for the killed log process?
- Why are there so many recoveries, and maybe a couple of failed recoveries?
- Does the “Not enough physical servers available” error indicate a configuration issue in our cluster topology?
I will continue to dig at this.