30 server cluster just died

A cluster of 30 servers, all identical: linode 4 cpu; 8gb ram; 160gb ssd

Was humming along for months, suddenly stopped responding:


Using cluster file `/etc/foundationdb/fdb.cluster'.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Initializing new transaction servers and recovering transaction logs.

Tried restarting all servers at once, nothing changed, the tried “kill; kill all”, error message chanegd to:


Using cluster file `/etc/foundationdb/fdb.cluster'.

Locking coordination state. Verify that a majority of coordination server
processes are active.

  192.168.128.17:4500  (reachable)
  192.168.190.195:4500  (reachable)
  192.168.192.85:4500  (reachable)

Unable to locate the data distributor worker.

Unable to locate the ratekeeper worker.

Here, latest “status json” will be included in next message.

Anything will help at this moment. This is (was) a live cluster. Config is ssd couble.

status json output was too big to paste here, so first paste-bin i found: fdb status json - Pastebin

I noticed that a few processes are marked as “degraded”, which I think might mean they’re having trouble writing to disk. Another thing that might be helpful is to grep through the log directories for events with “Severity” = “40”.

Also grep MasterRecovery* event.

A side note: having only 3 CDs for a 30 host cluster is not a good idea. You should consider using more CDs, say 7 CDs, to tolerate 3 failures.

After some down time, with status as described in previous messages, I tried “kill; kill all” again. After seconds the cluster became operational, with status showing everything ok.

@andrew.noyes thanks for the response, but there’s no Severity=“40” in logs (normal and degraded machines). How to I figure out why a server is “degraded”? Seems like new machine, with mostly empty SSD disk.

@mengxu could you please expand what does “CD” stand for? Thanks.

@mengxu found some MasterRecovery logs


<Event Severity="10" Time="1617550218.589614" Type="MasterRecovery" ID="4b63fd9a6802dbe3" BeginPair="33d0fdf9a963418c" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" />
<Event Severity="10" Time="1617550218.589614" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="0" Status="reading_coordinated_state" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550218.593551" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="1" Status="locking_coordinated_state" TLogs="0" ActiveGenerations="1" MyRecoveryCount="2" ForceRecovery="0" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550218.593551" Type="MasterRecoveryGenerations" ID="4b63fd9a6802dbe3" ActiveGenerations="1" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550218.595216" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="3" Status="reading_transaction_system_state" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550218.595216" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="5" Status="configuration_never_created" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.601350" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="7" Status="recruiting_transaction_servers" RequiredTLogs="1" DesiredTLogs="3" RequiredProxies="1" DesiredProxies="3" RequiredResolvers="1" DesiredResolvers="1" StoreType="memory" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.601974" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="8" Status="initializing_transaction_servers" Proxies="1" TLogs="1" Resolvers="1" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MS" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.621812" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="9" Status="recovery_transaction" PrimaryLocality="-1" DcId="[not set]" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.621812" Type="MasterRecoveryCommit" ID="4b63fd9a6802dbe3" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" />
<Event Severity="10" Time="1617550219.625397" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="10" Status="writing_coordinated_state" TLogList="0: 7a07a327f87d08680c64603903fbebcc " Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.626193" Type="MasterRecovery" ID="4b63fd9a6802dbe3" EndPair="33d0fdf9a963418c" RecoveryTransactionVersion="1" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" />
<Event Severity="10" Time="1617550219.626193" Type="MasterRecoveryDuration" ID="4b63fd9a6802dbe3" RecoveryDuration="0.00438118" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.626193" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="11" Status="accepting_commits" StoreType="memory" RecoveryDuration="0.00438118" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.626193" Type="MasterRecoveryState" ID="4b63fd9a6802dbe3" StatusCode="14" Status="fully_recovered" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1617550219.626193" Type="MasterRecoveryGenerations" ID="4b63fd9a6802dbe3" ActiveGenerations="1" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP,MS,RV,SS,TL" TrackLatestType="Original" />
<Event Severity="30" Time="1622890112.163243" OriginalTime="1622822671.687044" Type="MasterRecoveryDuration" ID="98b994142cd2a3cf" RecoveryDuration="7.9465" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,RV,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622890112.163243" OriginalTime="1622822679.520790" Type="MasterRecoveryGenerations" ID="98b994142cd2a3cf" ActiveGenerations="1" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622890112.163243" OriginalTime="1622822679.520790" Type="MasterRecoveryState" ID="98b994142cd2a3cf" StatusCode="14" Status="fully_recovered" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1622895242.471707" OriginalTime="1622822671.687044" Type="MasterRecoveryDuration" ID="98b994142cd2a3cf" RecoveryDuration="7.9465" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,RV,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622895242.471707" OriginalTime="1622822679.520790" Type="MasterRecoveryGenerations" ID="98b994142cd2a3cf" ActiveGenerations="1" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622895242.471707" OriginalTime="1622822679.520790" Type="MasterRecoveryState" ID="98b994142cd2a3cf" StatusCode="14" Status="fully_recovered" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1622901548.959852" OriginalTime="1622822671.687044" Type="MasterRecoveryDuration" ID="98b994142cd2a3cf" RecoveryDuration="7.9465" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,RV,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622901548.959852" OriginalTime="1622822679.520790" Type="MasterRecoveryGenerations" ID="98b994142cd2a3cf" ActiveGenerations="1" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1622901548.959852" OriginalTime="1622822679.520790" Type="MasterRecoveryState" ID="98b994142cd2a3cf" StatusCode="14" Status="fully_recovered" Machine="192.168.145.92:4500" LogGroup="default" Roles="MS,SS" TrackLatestType="Rolled" />

Dos this mean anything to you?

CD was used as an abbreviation for coordinators. The general advice is that you run a specific number of coordinators based on your configuration (but not necessarily your cluster size).

In a triple replicated cluster, the fault tolerance is intended to be 2 (2 machines can fail without losing data or availability). You need at least 5 coordinators to provide that guarantee for the coordination role, since you need a quorum of coordinators to be available. Similarly, double replicated clusters would need 3 coordinators to achieve the target fault tolerance.

There are also some configurations that benefit from more coordinators. For example, in multi-region configurations you may want to tolerate one region failure plus one additional machine failure, and the way to do that with coordinators is to run 3 in each of 3 different regions, for a total of 9.

You can run more coordinators than specified above, but it doesn’t necessarily change your guarantees significantly. You’ll be more resilient to losses of coordinators, but you’ll still be susceptible to other losses.

CD is coordinator.

does the log you send cover the incident period?
Two observations:

  1. The machine is using IP 127.0.0.1, which seems wrong.
  2. the process is used for many roles (check the “Roles” keyword). It does not look like a correct config for a 30 host cluster.