During a gameday, we scaled down entirely one AZ on a cluster running in three_data_hall redundancy. After few hours of availability, the cluster started to became unavailable.
The output of the command fdbcli --exec "status details"
:
Could not communicate with all of the coordination servers.
The database will remain operational as long as we
can connect to a quorum of servers, however the fault
tolerance of the system is reduced as long as the
servers remain disconnected.
x.x.x.x:4500 (unreachable)
x.x.x.x:4500 (reachable)
x.x.x.x:4500 (reachable)
x.x.x.x:4500 (unreachable)
x.x.x.x:4500 (reachable)
x.x.x.x:4500 (reachable)
x.x.x.x:4500 (unreachable)
x.x.x.x:4500 (reachable)
x.x.x.x:4500 (reachable)
Locking coordination state. Verify that a majority of coordination server
processes are active.
Fetching consistency scan information timed out.
The cluster is running on 9 nodes in version 7.2, where we do have 9 coordinators.
It seems that we still have a majority of our coordinators, even after the disruption. But the cluster seems blocked at the coordination state. Given this doc, the cluster may be blocked during a recovery where it was trying to recruit TLogs.
When we are restoring the traffic on the failing zone, the cluster came back to available.
Here my questions :
- is it expected to have a FoundationDb cluster unavailable in three_data_hall where only one zone is failing ?
- how can I monitor/troubleshoot properly a recovery ? For now I only checked the
cluster.generation
metric, but it’s not clear to me when I should worry about the this value (is there a threshold, etc) - is there a way to force recovery when the cluster is stucked by a recovery ?
- is there some trace that can explain the reason about why TLogs did not succeed to be recruited ?
Thank you for your help,