Cluster stuck in recovery after crash of one node

xozzslip · March 17, 2022, 11:59am

I have a v6.3.18 foundation db cluster deployed on 3 nodes (and 12 disks). Recently cloud provider had an outage. One node in my cluster was down for several hours. Cloud provider recovered but foundationd DB stuck in the recovery state (for 40+ hours). Totaly ~10TB stored in the cluster.

foundationdb.conf

user = foundationdb
group = foundationdb

[general]
restart_delay = 60
restart_delay
cluster_file = /etc/foundationdb/fdb.cluster

[fdbserver]
command = /usr/sbin/fdbserver
public_address = auto:$ID:tls
listen_address = public
datadir = /data/$ID
logdir = /var/log/foundationdb
tls_ca_file=/etc/foundationdb/tls/fdbprodCA.pem
tls_certificate_file=/etc/foundationdb/tls/fdbprod.pem
tls_key_file=/etc/foundationdb/tls/fdbprod.key
tls_verify_peers=Check.Valid=0

[fdbserver.4500]
class=stateless
[fdbserver.4501]
class=storage
datadir=/data/nvme1n1/foundationdb/data/$ID
logdir=/data/nvme1n1/foundationdb
[fdbserver.4502]
class=storage
datadir=/data/nvme2n1/foundationdb/data/$ID
logdir=/data/nvme2n1/foundationdb
[fdbserver.4503]
class=storage
datadir=/data/nvme3n1/foundationdb/data/$ID
logdir=/data/nvme3n1/foundationdb
[fdbserver.4504]
class=transaction
datadir=/data/nvme4n1/foundationdb/data/$ID
logdir=/data/nvme4n1/foundationdb

[fdbserver.4505]
class=stateless
[fdbserver.4506]
class=stateless
[fdbserver.4507]
class=stateless
[fdbserver.4508]
class=stateless
[fdbserver.4509]
class=stateless
[fdbserver.4510]
class=stateless
[fdbserver.4511]
class=stateless```

fdbcli status details

Using cluster file `/etc/foundationdb/fdb.cluster’.

Unable to read database configuration.

Configuration:
Redundancy mode - unknown
Storage engine - unknown
Coordinators - unknown
Usable Regions - unknown

Cluster:
FoundationDB processes - 36
Zones - 3
Machines - 3
Memory availability - 8.0 GB per process on machine with least available
Server time - 03/17/22 12:34:35

Data:
Replication health - unknown
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - unknown

Operating space:
Unable to retrieve operating space status

Workload:
Read rate - unknown
Write rate - unknown
Transactions started - unknown
Transactions committed - unknown
Conflict rate - unknown

Backup and DR:
Running backups - 0
Running DRs - 0

Process performance details:
65.108.100.113:4500:tls ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4501:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4502:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4503:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4504:tls ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.5 GB / 8.0 GB RAM )
65.108.100.113:4505:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4506:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4507:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4508:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4509:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4510:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4511:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4500:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4501:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.115:4502:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.115:4503:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.115:4504:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.5 GB / 8.0 GB RAM )
65.108.100.115:4505:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4506:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4507:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4508:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4509:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4510:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4511:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4500:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4501:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4502:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4503:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4504:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4505:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4506:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4507:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4508:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4509:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4510:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4511:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )

Coordination servers:
65.108.100.113:4500:tls (reachable)
65.108.100.115:4500:tls (reachable)
65.108.100.116:4500:tls (reachable)

Client time: 03/17/22 12:34:30

fdbcli status json

I guess that cluster stuck in recovery state because I see the following in status:

"generation" : 45,

"recovery_state" : {
    "active_generations" : 7,
    "description" : "Accepting commits. All logs recruited.",
    "name" : "all_logs_recruited"
}

Also I see in the log events related to recovery, one of them is MasterRecoveryDuration

List of unique events

Severity="30"
MasterRecoveryDuration
RkSSListFetchTimeout

Severity="20"
N2_ReadError
Net2RunLoopTrace

Severity="10"
ClusterControllerMetrics
ClusterGetStatus
ConnectionClosed
ConnectionEstablished
ConnectionFrom
CoordinationPing
CoordinationPingSent
GenerationRegReadReply
GenerationRegReadRequest
GetLeaderReply
GetValueAttachID
GetValueDebug
Histogram
IncomingConnection
MachineLoadDetail
MachineMetrics
MasterMetrics
MasterRecoveredConfig
MasterRecoveryGenerations
MasterRecoveryState
MemoryMetrics
MonitorLeaderForProxiesGotClientInfo
NetworkMetrics
NotifyAddressFailed
NotifyAddressHealthy
PeerDestroy
PingLatency
ProcessMetrics
ProcessTimeOffset
ProgramStart
RecruitedMasterWorker
ResolverMetrics
RkUpdate
RkUpdateBatch
Role
SlowTask
SomewhatSlowRunLoopTop
TransactionDebug
TransactionMetrics

Examples of interesting events

<Event Severity="30" Time="1647515695.739040" OriginalTime="1647432679.971453" DateTime="2022-03-17T11:14:55Z" OriginalDateTime="2022-03-16T12:11:19Z" Type="MasterRecoveryDuration" ID="0469ded4fbeabedf" RecoveryDuration="16.8381" Machine="65.108.100.113:4505" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1647515695.739040" OriginalTime="1647432663.487543" DateTime="2022-03-17T11:14:55Z" OriginalDateTime="2022-03-16T12:11:03Z" Type="MasterRecoveryGenerations" ID="0469ded4fbeabedf" ActiveGenerations="7" Machine="65.108.100.113:4505" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1647515695.739040" OriginalTime="1647432679.971453" DateTime="2022-03-17T11:14:55Z" OriginalDateTime="2022-03-16T12:11:19Z" Type="MasterRecoveryState" ID="0469ded4fbeabedf" StatusCode="12" Status="all_logs_recruited" Machine="65.108.100.113:4505" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1647491031.065689" DateTime="2022-03-17T04:23:51Z" Type="RkSSListFetchTimeout" ID="9fea49a04a5d1838" SuppressedEventCount="19" Machine="65.108.100.113:4500" LogGroup="default" Roles="CD,RK" />
<Event Severity="20" Time="1647437668.876271" DateTime="2022-03-16T13:34:28Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1647437668.792293" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7fecbf5fe730 0x7fecbf5fb5a1 0x2378b7f 0x237a428 0x79c212 0x7fecbf2c709b 0x7dd675" Machine="65.108.100.113:4506" LogGroup="default" />
<Event Severity="20" Time="1647516990.675524" DateTime="2022-03-17T11:36:30Z" Type="N2_ReadError" ID="86a0e5e07afd550c" SuppressedEventCount="0" ErrorCode="1" Message="stream truncated" Machine="65.108.100.113:4511" LogGroup="default" Roles="CC" />

Can I do somethings with that? How much time can recovery process get?

jzhou · March 18, 2022, 3:45am

Since recovery_state is “Accepting commits. All logs recruited.”, this means your cluster has finished recovery and ready to accept transaction commits.

The problem is more likely that Ratekeeper is throttling traffic, and I see this event Type="RkSSListFetchTimeout" ID="9fea49a04a5d1838" SuppressedEventCount="19" Machine="65.108.100.113:4500" LogGroup="default" Roles="CD,RK". This indicates that Ratekeeper is setting TPS limit to 0 for reason 11 (storage_server_list_fetch_failed). Essentially the storage server(s) that store the storage server list is busy or failed. So you might want to check if there are unrecovered storage servers and try to bring them back.

Topic		Replies	Views
Database unavailable after shutting down a foundationdb node Using FoundationDB	17	8545	February 5, 2021
Fdbserver error in a cluster with double redundancy Using FoundationDB	2	856	September 22, 2020
Recovery/Reviving a Storage Full Cluster Using FoundationDB	1	45	May 15, 2025
Cluster in stuck Using FoundationDB	1	195	July 18, 2024
30 server cluster just died Using FoundationDB	7	728	June 6, 2021

Cluster stuck in recovery after crash of one node

Related topics