Cluster stuck in recovery

We recently had a ~40 minute outage in an FDB cluster, where recovery was not able to complete.
We are running on kubernetes, using the kubernetes operator, and using kubernetes service IP as the public IP of the FDB pods. At the time of this incident, we had a log pod that was in a terminating state. While the log pod was in a terminating state, a stateless pod with roles (DD,MP,MS,RK) was terminated, triggering the recovery.

We are running on version 6.2.28

This is the sequence of recovery steps from the logs:
“time [UTC]“,statusCode,status
“2/23/2021, 8:33:48.007 AM”,0,“reading_coordinated_state”
“2/23/2021, 8:37:19.354 AM”,0,“reading_coordinated_state”
“2/23/2021, 8:37:19.357 AM”,1,“locking_coordinated_state”
“2/23/2021, 8:37:19.363 AM”,3,“reading_transaction_system_state”
“2/23/2021, 8:37:19.372 AM”,7,“recruiting_transaction_servers”
“2/23/2021, 8:37:19.373 AM”,8,“initializing_transaction_servers”
“2/23/2021, 9:16:14.082 AM”,0,“reading_coordinated_state”
“2/23/2021, 9:16:14.089 AM”,1,“locking_coordinated_state”
“2/23/2021, 9:16:14.098 AM”,3,“reading_transaction_system_state”
“2/23/2021, 9:16:14.160 AM”,7,“recruiting_transaction_servers”
“2/23/2021, 9:16:14.162 AM”,8,“initializing_transaction_servers”
“2/23/2021, 9:16:14.210 AM”,9,“recovery_transaction”
“2/23/2021, 9:16:14.311 AM”,10,“writing_coordinated_state”
“2/23/2021, 9:16:14.318 AM”,12,“all_logs_recruited”
“2/23/2021, 9:16:14.318 AM”,11,“accepting_commits”
“2/23/2021, 9:16:21.688 AM”,14,“fully_recovered”
“2/23/2021, 9:18:24.598 AM”,14,“fully_recovered”

The terminating log pod finally went away at around 9:13 and came back again at 9:15, and it seems like recovery was able to proceed normally after that.

We have filed an issue on the operator to allow new traffic to the pod through the service, even when in a terminating state:
The ServiceIP approach can lead to grey failures · Issue #520 · FoundationDB/fdb-kubernetes-operator · GitHub.

Is it likely that the grey failure (partial connectivity) of the log pod caused recovery to get stuck, as the pod was considered alive enough to be part of recovery, but not able to initialize the new log role?

I am happy to paste more trace data, just not sure what events are most useful for investigation.

Yes, the situation described in that issue, where the pod can make outbound connections but not receive inbound connections, could very well lead to an extended recovery.

Thanks for the confirmation. We will focus our efforts on getting the fix for that issue merged and verified, then.

This can be reproduced with the fdb-kubernetes-operator by taking the following steps:

  • Create a new foundationdb cluster with 9 storage, 6 log and 3 stateless pods, using service-ip’s.
  • Wait for the cluster to come up healthy (delete the fdb-kubernetes-operator afterwards).
  • Edit the service pointing to the log-6 pod to point to a non-existent log-666 pod. This sets up an asymmetric network partition. Log-6 can talk to the rest of the cluster, but the rest of the cluster cannot talk to log-6.
  • Wait 5 minutes, the cluster is seemingly still happy. Note that this step seems to be required.
  • Kill the stateless pod with the master role.
  • The cluster starts recovery, but is stuck at initializing_transaction_servers.
  • The cluster is unavailable indefinitely.

We can now proceed to either heal the asymmetric network partition, or delete the log-6 pod.

  • If we heal the asymmetric network partition the cluster becomes available again immediately.
  • If we remove the log-6 pod the cluster is still stuck at the same recovery step for around 30 minutes, and then becomes available again.

In the last scenario, where we go from asymmetric network partition to a complete partition, the FDB cluster can be unavailable for a long time. Is it possible to set a shorter timeout for the recovery process to restart if it takes too long?