Cluster stuck with status "Locked coordination state" even when all coordination servers available

We run a cluster population automation that does the following -

  1. Starts a new, triple-replicated cluster on 5 machines (1 coordinator per machine).
  2. Transacts a custom metadata key into the database.
  3. Sets the replication factor to single.
  4. Populates the cluster using a number of clients transacting objects simultaneously in a loop.
  5. Once populated, we move it back to triple replication.

We’ve been facing issues where clients timeout in a read operation right at the beginning of step 4. The read operation is on the same key that was transacted in step 2. We dump the cluster status at such a time, and get the following -

fdbcli --exec "status details"

Locking coordination state. Verify that a majority of coordination server
processes are active.

10.134.188.8:4271  (reachable)
10.134.188.9:4271  (reachable)
10.134.188.109:4271  (reachable)
10.134.188.116:4271  (reachable)
10.134.188.119:4271  (reachable)

Any insights into why the cluster might be stuck? This happens only sometimes, so hard to repro the exact scenario.

Can you look in the trace logs and see if there are any “severity 40” events?

Locking coordination state message means that FDB recovery is in progress. See

for details.

Unfortunally, there is no any troubleshooting guide on the recovery stuck.

Some similar issue that causes the same result is The database gets unavailable after changing usable_regions · Issue #3925 · apple/foundationdb · GitHub

1 Like