Cluster stuck with status "Locked coordination state" even when all coordination servers available

ashishgupta · February 1, 2021, 9:27am

We run a cluster population automation that does the following -

Starts a new, triple-replicated cluster on 5 machines (1 coordinator per machine).
Transacts a custom metadata key into the database.
Sets the replication factor to single.
Populates the cluster using a number of clients transacting objects simultaneously in a loop.
Once populated, we move it back to triple replication.

We’ve been facing issues where clients timeout in a read operation right at the beginning of step 4. The read operation is on the same key that was transacted in step 2. We dump the cluster status at such a time, and get the following -

fdbcli --exec "status details"

Locking coordination state. Verify that a majority of coordination server
processes are active.

10.134.188.8:4271  (reachable)
10.134.188.9:4271  (reachable)
10.134.188.109:4271  (reachable)
10.134.188.116:4271  (reachable)
10.134.188.119:4271  (reachable)

Any insights into why the cluster might be stuck? This happens only sometimes, so hard to repro the exact scenario.

andrew.noyes · February 1, 2021, 5:28pm

Can you look in the trace logs and see if there are any “severity 40” events?

osamarin · February 4, 2021, 12:29pm

Locking coordination state message means that FDB recovery is in progress. See

github.com

apple/foundationdb/blob/master/design/recovery-internals.md

# FDB Recovery Internals

FDB uses recovery to handle various failures, such as hardware and network failures. When the current transaction system no longer works properly due to failures, recovery is automatically triggered to create a new generation of the transaction system.

This document explains at the high level how the recovery works in a single cluster. The audience of this document includes both FDB developers who want to have a basic understanding of the recovery process and database administrators who need to understand why a cluster fails to recover. This document does not discuss the complexity introduced to the recovery process by the multi-region configuration.

## Background

## `ServerDBInfo` data structure

This data structure contains transient information which is broadcast to all workers for a database, permitting them to communicate with each other. It contains, for example, the interfaces for cluster controller (CC), master, ratekeeper, and resolver, and holds the log system's configuration. Only part of the data structure, such as `ClientDBInfo` that contains the list of GRV proxies and commit proxies, is available to the client.

Whenever a field of the `ServerDBInfo`is changed, the new value of the field, say new master's interface, will be sent to the CC and CC will propagate the new `ServerDBInfo` to all workers in the cluster.

## When will recovery happen?
Failure of certain roles in FDB can cause recovery. Those roles are cluster controller, master, GRV proxy, commit proxy, transaction logs (tLog), resolvers, log router, and backup workers.

Network partition or failures can make CC unable to reach some roles, treating those roles as dead and causing recovery. If CC cannot connect to a majority of coordinators, it will be treated as dead by coordinators and recovery will happen.

Better master exists event can trigger recoveries. Better master exists event is the cluster changes such that there is a better location for some already recruited processes (say master role).

This file has been truncated. show original

for details.

Unfortunally, there is no any troubleshooting guide on the recovery stuck.

Some similar issue that causes the same result is The database gets unavailable after changing usable_regions · Issue #3925 · apple/foundationdb · GitHub

Topic		Replies	Views
"Locking coordination state" after losing a AZ on three_data_hall Using FoundationDB	1	236	February 28, 2024
'Locking coordination' state after process removal Using FoundationDB	7	2092	July 11, 2019
Locking coordination state with DR Using FoundationDB	3	507	May 18, 2022
Locking coordination state. Verify that a majority of coordinattion server process are active. Single machine Using FoundationDB	4	1168	March 8, 2021
Cluster stuck in recovery Running FoundationDB	3	686	March 12, 2021

Cluster stuck with status "Locked coordination state" even when all coordination servers available

Related topics