30 server cluster just died

espaiz · June 5, 2021, 4:22pm

A cluster of 30 servers, all identical: linode 4 cpu; 8gb ram; 160gb ssd

Was humming along for months, suddenly stopped responding:


Using cluster file `/etc/foundationdb/fdb.cluster'.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Initializing new transaction servers and recovering transaction logs.

Tried restarting all servers at once, nothing changed, the tried “kill; kill all”, error message chanegd to:


Using cluster file `/etc/foundationdb/fdb.cluster'.

Locking coordination state. Verify that a majority of coordination server
processes are active.

  192.168.128.17:4500  (reachable)
  192.168.190.195:4500  (reachable)
  192.168.192.85:4500  (reachable)

Unable to locate the data distributor worker.

Unable to locate the ratekeeper worker.

Here, latest “status json” will be included in next message.

Anything will help at this moment. This is (was) a live cluster. Config is ssd couble.

espaiz · June 5, 2021, 4:24pm

status json output was too big to paste here, so first paste-bin i found: fdb status json - Pastebin

andrew.noyes · June 5, 2021, 6:01pm

I noticed that a few processes are marked as “degraded”, which I think might mean they’re having trouble writing to disk. Another thing that might be helpful is to grep through the log directories for events with “Severity” = “40”.

mengxu · June 6, 2021, 1:57am

Also grep MasterRecovery* event.

A side note: having only 3 CDs for a 30 host cluster is not a good idea. You should consider using more CDs, say 7 CDs, to tolerate 3 failures.

espaiz · June 6, 2021, 3:22am

After some down time, with status as described in previous messages, I tried “kill; kill all” again. After seconds the cluster became operational, with status showing everything ok.

@andrew.noyes thanks for the response, but there’s no Severity=“40” in logs (normal and degraded machines). How to I figure out why a server is “degraded”? Seems like new machine, with mostly empty SSD disk.

@mengxu could you please expand what does “CD” stand for? Thanks.

espaiz · June 6, 2021, 3:26am

@mengxu found some MasterRecovery logs


<Event Severity="10" Time="1617550218.589614" Type="MasterRecovery" ID="4b63fd9a6802dbe3" BeginPair="33d0fdf9a963418c" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" />
<Event Severity="10" Time="1617550218.589614" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="0" Status="reading_coordinated_state" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550218.593551" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="1" Status="locking_coordinated_state" TLogs="0" ActiveGenerations="1" MyRecoveryCount="2" ForceRecovery="0" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550218.593551" Type="MasterRecoveryGenerations" ID="4b63fd9a6802dbe3" ActiveGenerations="1" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550218.595216" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="3" Status="reading_transaction_system_state" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550218.595216" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="5" Status="configuration_never_created" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.601350" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="7" Status="recruiting_transaction_servers" RequiredTLogs="1" DesiredTLogs="3" RequiredProxies="1" DesiredProxies="3" RequiredResolvers="1" DesiredResolvers="1" StoreType="memory" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.601974" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="8" Status="initializing_transaction_servers" Proxies="1" TLogs="1" Resolvers="1" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.621812" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="9" Status="recovery_transaction" PrimaryLocality="-1" DcId="[not set]" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.621812" Type="MasterRecoveryCommit" ID="4b63fd9a6802dbe3" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" />
<Event Severity="10" Time="1617550219.625397" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="10" Status="writing_coordinated_state" TLogList="0: 7a07a327f87d08680c64603903fbebcc " Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.626193" Type="MasterRecovery" ID="4b63fd9a6802dbe3" EndPair="33d0fdf9a963418c" RecoveryTransactionVersion="1" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" />
<Event Severity="10" Time="1617550219.626193" Type="MasterRecoveryDuration" ID="4b63fd9a6802dbe3" RecoveryDuration="0.00438118" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.626193" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="11" Status="accepting_commits" StoreType="memory" RecoveryDuration="0.00438118" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.626193" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="14" Status="fully_recovered" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.626193" Type="MasterRecoveryGenerations" ID="4b63fd9a6802dbe3" ActiveGenerations="1" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="30" Time="1622890112.163243" OriginalTime="1622822671.687044" Type="MasterRecoveryDuration" ID="98b994142cd2a3cf" RecoveryDuration="7.9465" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,RV,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622890112.163243" OriginalTime="1622822679.520790" Type="MasterRecoveryGenerations" ID="98b994142cd2a3cf" ActiveGenerations="1" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622890112.163243" OriginalTime="1622822679.520790" Type="MasterRecoveryState" ID="98b994142cd2a3cf" StatusCode="14" Status="fully_recovered" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1622895242.471707" OriginalTime="1622822671.687044" Type="MasterRecoveryDuration" ID="98b994142cd2a3cf" RecoveryDuration="7.9465" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,RV,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622895242.471707" OriginalTime="1622822679.520790" Type="MasterRecoveryGenerations" ID="98b994142cd2a3cf" ActiveGenerations="1" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622895242.471707" OriginalTime="1622822679.520790" Type="MasterRecoveryState" ID="98b994142cd2a3cf" StatusCode="14" Status="fully_recovered" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1622901548.959852" OriginalTime="1622822671.687044" Type="MasterRecoveryDuration" ID="98b994142cd2a3cf" RecoveryDuration="7.9465" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,RV,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622901548.959852" OriginalTime="1622822679.520790" Type="MasterRecoveryGenerations" ID="98b994142cd2a3cf" ActiveGenerations="1" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622901548.959852" OriginalTime="1622822679.520790" Type="MasterRecoveryState" ID="98b994142cd2a3cf" StatusCode="14" Status="fully_recovered" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />

Dos this mean anything to you?

ajbeamon · June 6, 2021, 3:04pm

CD was used as an abbreviation for coordinators. The general advice is that you run a specific number of coordinators based on your configuration (but not necessarily your cluster size).

In a triple replicated cluster, the fault tolerance is intended to be 2 (2 machines can fail without losing data or availability). You need at least 5 coordinators to provide that guarantee for the coordination role, since you need a quorum of coordinators to be available. Similarly, double replicated clusters would need 3 coordinators to achieve the target fault tolerance.

There are also some configurations that benefit from more coordinators. For example, in multi-region configurations you may want to tolerate one region failure plus one additional machine failure, and the way to do that with coordinators is to run 3 in each of 3 different regions, for a total of 9.

You can run more coordinators than specified above, but it doesn’t necessarily change your guarantees significantly. You’ll be more resilient to losses of coordinators, but you’ll still be susceptible to other losses.

mengxu · June 6, 2021, 6:34pm

CD is coordinator.

does the log you send cover the incident period?
Two observations:

The machine is using IP 127.0.0.1, which seems wrong.
the process is used for many roles (check the “Roles” keyword). It does not look like a correct config for a 30 host cluster.

Topic		Replies	Views
Database unavailable after shutting down a foundationdb node Using FoundationDB	17	8543	February 5, 2021
Fdb cluster is unavailable after delete a disk Using FoundationDB	3	1147	July 9, 2020
Locking coordination state. Verify that a majority of coordinattion server process are active. Single machine Using FoundationDB	4	1173	March 8, 2021
Triple ssd fdb cluster on 3 node, one node poweroff, but the fdb cluster is unavailable! Using FoundationDB	2	696	July 7, 2020
Troubles scaling up the cluster Using FoundationDB	31	3734	November 1, 2018

30 server cluster just died

Related topics