"Locking coordination state" after losing a AZ on three_data_hall

During a gameday, we scaled down entirely one AZ on a cluster running in three_data_hall redundancy. After few hours of availability, the cluster started to became unavailable.
The output of the command fdbcli --exec "status details" :

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  x.x.x.x:4500  (unreachable)
  x.x.x.x:4500  (reachable)
  x.x.x.x:4500  (reachable)
  x.x.x.x:4500  (unreachable)
  x.x.x.x:4500  (reachable)
  x.x.x.x:4500  (reachable)
  x.x.x.x:4500  (unreachable)
  x.x.x.x:4500  (reachable)
  x.x.x.x:4500  (reachable)

Locking coordination state. Verify that a majority of coordination server
processes are active.

Fetching consistency scan information timed out.

The cluster is running on 9 nodes in version 7.2, where we do have 9 coordinators.
It seems that we still have a majority of our coordinators, even after the disruption. But the cluster seems blocked at the coordination state. Given this doc, the cluster may be blocked during a recovery where it was trying to recruit TLogs.

When we are restoring the traffic on the failing zone, the cluster came back to available.

Here my questions :

  • is it expected to have a FoundationDb cluster unavailable in three_data_hall where only one zone is failing ?
  • how can I monitor/troubleshoot properly a recovery ? For now I only checked the cluster.generation metric, but it’s not clear to me when I should worry about the this value (is there a threshold, etc)
  • is there a way to force recovery when the cluster is stucked by a recovery ?
  • is there some trace that can explain the reason about why TLogs did not succeed to be recruited ?

Thank you for your help,

BTW I forget to update the ticket : we resolved the issue by fixing a bug in the way we were choosing our coordinators. We did not filter the faulty nodes from the coordinator candidate list.