We run a cluster population automation that does the following -
Starts a new, triple-replicated cluster on 5 machines (1 coordinator per machine).
Transacts a custom metadata key into the database.
Sets the replication factor to single.
Populates the cluster using a number of clients transacting objects simultaneously in a loop.
Once populated, we move it back to triple replication.
We’ve been facing issues where clients timeout in a read operation right at the beginning of step 4. The read operation is on the same key that was transacted in step 2. We dump the cluster status at such a time, and get the following -
fdbcli --exec "status details"
Locking coordination state. Verify that a majority of coordination server
processes are active.
10.134.188.8:4271 (reachable)
10.134.188.9:4271 (reachable)
10.134.188.109:4271 (reachable)
10.134.188.116:4271 (reachable)
10.134.188.119:4271 (reachable)
Any insights into why the cluster might be stuck? This happens only sometimes, so hard to repro the exact scenario.