We recently had a ~40 minute outage in an FDB cluster, where recovery was not able to complete.
We are running on kubernetes, using the kubernetes operator, and using kubernetes service IP as the public IP of the FDB pods. At the time of this incident, we had a log pod that was in a terminating state. While the log pod was in a terminating state, a stateless pod with roles (DD,MP,MS,RK) was terminated, triggering the recovery.
We are running on version 6.2.28
This is the sequence of recovery steps from the logs:
“time [UTC]“,statusCode,status
“2/23/2021, 8:33:48.007 AM”,0,“reading_coordinated_state”
“2/23/2021, 8:37:19.354 AM”,0,“reading_coordinated_state”
“2/23/2021, 8:37:19.357 AM”,1,“locking_coordinated_state”
“2/23/2021, 8:37:19.363 AM”,3,“reading_transaction_system_state”
“2/23/2021, 8:37:19.372 AM”,7,“recruiting_transaction_servers”
“2/23/2021, 8:37:19.373 AM”,8,“initializing_transaction_servers”
“2/23/2021, 9:16:14.082 AM”,0,“reading_coordinated_state”
“2/23/2021, 9:16:14.089 AM”,1,“locking_coordinated_state”
“2/23/2021, 9:16:14.098 AM”,3,“reading_transaction_system_state”
“2/23/2021, 9:16:14.160 AM”,7,“recruiting_transaction_servers”
“2/23/2021, 9:16:14.162 AM”,8,“initializing_transaction_servers”
“2/23/2021, 9:16:14.210 AM”,9,“recovery_transaction”
“2/23/2021, 9:16:14.311 AM”,10,“writing_coordinated_state”
“2/23/2021, 9:16:14.318 AM”,12,“all_logs_recruited”
“2/23/2021, 9:16:14.318 AM”,11,“accepting_commits”
“2/23/2021, 9:16:21.688 AM”,14,“fully_recovered”
“2/23/2021, 9:18:24.598 AM”,14,“fully_recovered”
The terminating log pod finally went away at around 9:13 and came back again at 9:15, and it seems like recovery was able to proceed normally after that.
We have filed an issue on the operator to allow new traffic to the pod through the service, even when in a terminating state:
The ServiceIP approach can lead to grey failures · Issue #520 · FoundationDB/fdb-kubernetes-operator · GitHub.
Is it likely that the grey failure (partial connectivity) of the log pod caused recovery to get stuck, as the pod was considered alive enough to be part of recovery, but not able to initialize the new log role?
I am happy to paste more trace data, just not sure what events are most useful for investigation.