Are short outages when you lose a coordinator normal?

Hey all, we have a cluster in kubernetes that is in three_data_hall configuration with 9 cluster coordinators. If one of the cluster coordinator processes is unexpectedly killed (i.e. the pod is deleted), clients of the cluster seem to be unable to start or commit transactions for ~5-10 seconds as the FDB kubernetes operator selects new coordinators.

This does not happen when we ask the operator to exclude a process and reselect the coordinators.

Is this normal? How do we avoid this?

Part of this is normal. When the pod is deleted and the operator notices that the process is not reporting anymore it will select a new set of coordinators, this will cause a recovery. If the process is down long enough and the automatic replacements are enabled (default is enabled) the operator would “replace” the bad pod by creating a new one and doing an exclusion for the processes that were running on this pod. The exclusion could cause another recovery (depending on the process type). Seeing recoveries between 5-10s is not normal, usually those should be faster, the recovery time also depends a bit on your setup and the size of the cluster.

I think the interesting bit is that you’re not seeing this if you’re doing a replacement of the pod. I wonder if you have enough “standby” processes (or rather pods) that can be used. Per default the coordinators will preferable run on log processes. Assuming that the pod that gets deleted is a log process and you might not have enough standby pods, it could be possible that you’re not seeing one recovery but actually two (or more), e.g. one recovery for the coordinator change and another recovery when the process comes back again (not sure if that would correlate with the 5-10s of scheduling time).

Debugging recoveries and finding the root cause of a long recovery can be a bit tricky, this document has the internals of the recovery process: foundationdb/design/recovery-internals.md at main · apple/foundationdb · GitHub + this dashboard might we helpful foundationdb/contrib/observability_splunk_dashboard/recovery.xml at main · apple/foundationdb · GitHub (at least to get an idea what trace events are important to look at).

Thanks for the tips. I went back and tried a few other things and it’s not actually the loss of the coordinator that causes this, it’s actually the loss of a log role that then triggers the recovery.

We had 9 logs, 3 in each data hall, so I thought it could be that there wasn’t enough slack in the logs processes to handle a failure. So I increased the log roles (using the operator’s RoleCounts configuration) to 15, but still a few seconds of downtime when I lost a log role.

Maybe what I need to is have a standby log process that doesn’t have a log role assigned to it. I think I can achieve this by setting ProcessCounts.log to something more than the RoleCount.logs.

OK, the dashboard was helpful and I’ve got a more detailed timeline of what was happening. It turns out that we had 3 recoveries in one instance, but there are still some things that are puzzling about it. Here’s a timeline, and some logs that I found:

  1. 00:35:37.422 - I manually deleted one of the log process pods
  2. 00:35:37.571 - other FDB machines start getting N2_ConnectError and N2_ReadError with the machine I killed as the peer
  3. 00:35:39 - First downtime period starts - our load testing service stops being able to read / commit transactions
  4. 00:35:39.250 - The operator starts to recreate the pod
  5. 00:35:42 - First downtime period stops - our load testing service is again able to make transactions.
  6. 00:35:42.550 - We get our first WaitFailureClient from the cluster controller. This is accompanied by a ClusterRecoveryRetrying . What’s a bit strange about this log is that there is no log indicating a recovery has started. The first recovery I see only happens after this log line.

{
    "DateTime": "2025-09-18T16:35:42Z",
    "Error": "tlog_failed",
    "ErrorCode": "1205",
    "ErrorDescription": "Cluster recovery terminating because a TLog failed",
    "ID": "3459f5d1f72f910d",
    "LogGroup": "foundationdb-cluster-main",
    "Machine": "10.0.208.42:4500",
    "Roles": "CC",
    "Severity": "20",
    "ThreadID": "16997965542817237771",
    "Time": "1758213342.550678",
    "Type": "ClusterRecoveryRetrying"
}
  1. 00:35:42.587 - MasterRecoveryState with reading_coordinated_state - I think this is when the first recovery starts
  2. 00:35:42.598 - We get a “Not enough physical servers available”. This occurs with different types of trace logs (RecruitStorageNotAvailable , CCWDB, and ClusteryRecoveryRetrying). and occurs quite a lot until the cluster fully recovers ~20 seconds later

{
    "DateTime": "2025-09-18T16:35:44Z",
    "Error": "no_more_servers",
    "ErrorCode": "1008",
    "ErrorDescription": "Not enough physical servers available",
    "ID": "35163ad87d48095f",
    "LogGroup": "foundationdb-cluster-main",
    "Machine": "10.0.208.42:4500",
    "Roles": "CC",
    "Severity": "20",
    "ThreadID": "16997965542817237771",
    "Time": "1758213344.776349",
    "Type": "ClusterRecoveryRetrying"
}
  1. 00:35:42.735 - MasterRecoveryDuration - shows the recovery finishes in 148ms
  2. 00:35:44.620 - RestartingTxnSubsystem - with stage AwaitCommit . Maybe I am reading this line wrong, but we just did a recovery, so why is the transaction subsystem restarting?
  3. 00:35:44.812 - Another recovery starts
  4. 00:35:45.212 - Not sure if this is an error, but it was in the Splunk dashboard:
{
    "DateTime": "2025-09-18T16:35:45Z",
    "Error": "operation_failed",
    "ErrorCode": "1000",
    "ErrorDescription": "Operation failed",
    "GoodRecruitmentTimeReady": "0",
    "ID": "41168f4fb8efbfb1",
    "LogGroup": "foundationdb-cluster-main",
    "Machine": "10.0.120.245:4500",
    "Roles": "CC,DD",
    "Severity": "10",
    "ThreadID": "17276620476305664854",
    "Time": "1758213345.212184",
    "Type": "RecruitFromConfigurationRetry"
}
  1. 00:35:46 - Second period of downtime starts - this one is 14 seconds
  2. 00:35:46.915 - The second recovery finishes. This one takes 2 seconds
  3. 00:35:49.705 - Sometime around here, the recreated pod starts up and rejoins the cluster. I see the Net2Starting and other log lines from the node. I see the WorkerRegister log at 35:53.369, after the next recovery starts
  4. 00:35:51.258 - There seems to be another recovery here – several different nodes report a MasterRecoveryState, but I never get a MasterRecoveryDuration log from this
  5. 00:36:00 - Second period of downtime end
  6. 00:36:00.191 - The third recovery finishes. This one takes 357ms

In total, there was 17 seconds of downtime:

  1. 3 seconds initially when the log process was dead, but it seems that the cluster didn’t detect the pod was dead.
  2. 14 seconds around the second recovery. I’m not sure why this one was so long

Some questions that I have from my investigations:

  1. I notice that there is a knob for TLOG_TIMEOUT and the default value is 0.4. So why did it take almost 5 seconds to get the WaitFailureClient for the killed log process?
  2. Why are there so many recoveries, and maybe a couple of failed recoveries?
  3. Does the “Not enough physical servers available” error indicate a configuration issue in our cluster topology?

I will continue to dig at this.

I had another hypothesis that I tested out regarding the distribution of log servers. We are in three data hall, so this requires at least four log roles, two in two zones. I thought that the recoveries could be minimized if we ensured that we have enough standby log processes (previously with 9 log roles, we were using every log process).

So I configured out cluster to have:

  • The default four log roles
  • 12 log processes total (with a topology spread constraint resulting in 4 in each zone, so at least 2 standby in each zone).

When I manually killed one of the log roles, I ended up with only one recovery, which is better. But our client that was hitting the database still experienced ~5 seconds of downtime. I am wondering if this is may be due to the transactions that were inflight getting timed out after 5 seconds instead of an actual 5 second outage on the server.

After some more testing, I found another reason why we were getting two recoveries. Just to recap:

  • We were deployed in three data hall, with 9 log roles requested. Not processes, we were actually requesting 9 log roles using the databaseConfigurationfield in the operator. So this gave us exactly 9 log processes with no standbys.
    • If a log died, it would recovery once to rebuild the 8 logs, then recover again when the 9th log process pod was recreated.
    • We fixed this by requesting 9 log processes and only 4 log roles, so we have 5 standbys (3 in the AZ that has no log roles, and 1 in the AZ with the log roles. I think this is the only way to do it with k8s topology constraints)
  • With 9 log processes, the operator will choose all the log processes to be coordinators. It prefers logs over storage classes
    • So if you lose a log role, this also means you lose a coordinator
    • The cluster does a recovery to recruit a new log role from the standby
    • The operator chooses new coordinators, which again triggers a second recovery
    • I believe the solution to this is to have dedicated coordinator process classes, or to prefer storage processes (but this can be disruptive during an upgrade. I don’t think it will break an upgrade, but will cause more recoveries than necessary).
      • @johscheuer since log roles are preferred as the default, I think this will cause most replacements of a log process to trigger more than one recovery

Solving these two issues should get rid of the multiple recoveries in the case of a log role.

So the last issue here is why do recoveries seem to take about 5 seconds from a client’s perspective when it the metrics show it completes in ~200ms. I think that this has to do with the failure detection delay. It seems there is a knob FAILURE_DETECTION_DELAY that is set to 4 seconds before a role is detected as failed.