"Locking coordination state" after losing a AZ on three_data_hall

COran · November 27, 2023, 8:56am

During a gameday, we scaled down entirely one AZ on a cluster running in three_data_hall redundancy. After few hours of availability, the cluster started to became unavailable.
The output of the command fdbcli --exec "status details" :

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  x.x.x.x:4500  (unreachable)
  x.x.x.x:4500  (reachable)
  x.x.x.x:4500  (reachable)
  x.x.x.x:4500  (unreachable)
  x.x.x.x:4500  (reachable)
  x.x.x.x:4500  (reachable)
  x.x.x.x:4500  (unreachable)
  x.x.x.x:4500  (reachable)
  x.x.x.x:4500  (reachable)

Locking coordination state. Verify that a majority of coordination server
processes are active.

Fetching consistency scan information timed out.

The cluster is running on 9 nodes in version 7.2, where we do have 9 coordinators.
It seems that we still have a majority of our coordinators, even after the disruption. But the cluster seems blocked at the coordination state. Given this doc, the cluster may be blocked during a recovery where it was trying to recruit TLogs.

When we are restoring the traffic on the failing zone, the cluster came back to available.

Here my questions :

is it expected to have a FoundationDb cluster unavailable in three_data_hall where only one zone is failing ?
how can I monitor/troubleshoot properly a recovery ? For now I only checked the cluster.generation metric, but it’s not clear to me when I should worry about the this value (is there a threshold, etc)
is there a way to force recovery when the cluster is stucked by a recovery ?
is there some trace that can explain the reason about why TLogs did not succeed to be recruited ?

Thank you for your help,

COran · February 28, 2024, 10:03am

BTW I forget to update the ticket : we resolved the issue by fixing a bug in the way we were choosing our coordinators. We did not filter the faulty nodes from the coordinator candidate list.

Topic		Replies	Views
Cluster stuck with status "Locked coordination state" even when all coordination servers available Using FoundationDB	2	717	February 4, 2021
Could not communicate with a quorum of coordination servers Using FoundationDB	2	2230	March 5, 2020
Foundationdb cluster became unavailable after shutting down 1 az Using FoundationDB	0	247	August 23, 2023
Locking coordination state. Verify that a majority of coordinattion server process are active. Single machine Using FoundationDB	4	1174	March 8, 2021
Locking coordination state with DR Using FoundationDB	3	510	May 18, 2022

"Locking coordination state" after losing a AZ on three_data_hall

Related topics