Cluster stuck in recovery after crash of one node

I have a v6.3.18 foundation db cluster deployed on 3 nodes (and 12 disks). Recently cloud provider had an outage. One node in my cluster was down for several hours. Cloud provider recovered but foundationd DB stuck in the recovery state (for 40+ hours). Totaly ~10TB stored in the cluster.

foundationdb.conf
user = foundationdb
group = foundationdb

[general]
restart_delay = 60
restart_delay
cluster_file = /etc/foundationdb/fdb.cluster

[fdbserver]
command = /usr/sbin/fdbserver
public_address = auto:$ID:tls
listen_address = public
datadir = /data/$ID
logdir = /var/log/foundationdb
tls_ca_file=/etc/foundationdb/tls/fdbprodCA.pem
tls_certificate_file=/etc/foundationdb/tls/fdbprod.pem
tls_key_file=/etc/foundationdb/tls/fdbprod.key
tls_verify_peers=Check.Valid=0

[fdbserver.4500]
class=stateless
[fdbserver.4501]
class=storage
datadir=/data/nvme1n1/foundationdb/data/$ID
logdir=/data/nvme1n1/foundationdb
[fdbserver.4502]
class=storage
datadir=/data/nvme2n1/foundationdb/data/$ID
logdir=/data/nvme2n1/foundationdb
[fdbserver.4503]
class=storage
datadir=/data/nvme3n1/foundationdb/data/$ID
logdir=/data/nvme3n1/foundationdb
[fdbserver.4504]
class=transaction
datadir=/data/nvme4n1/foundationdb/data/$ID
logdir=/data/nvme4n1/foundationdb

[fdbserver.4505]
class=stateless
[fdbserver.4506]
class=stateless
[fdbserver.4507]
class=stateless
[fdbserver.4508]
class=stateless
[fdbserver.4509]
class=stateless
[fdbserver.4510]
class=stateless
[fdbserver.4511]
class=stateless```
fdbcli status details

Using cluster file `/etc/foundationdb/fdb.cluster’.

Unable to read database configuration.

Configuration:
Redundancy mode - unknown
Storage engine - unknown
Coordinators - unknown
Usable Regions - unknown

Cluster:
FoundationDB processes - 36
Zones - 3
Machines - 3
Memory availability - 8.0 GB per process on machine with least available
Server time - 03/17/22 12:34:35

Data:
Replication health - unknown
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - unknown

Operating space:
Unable to retrieve operating space status

Workload:
Read rate - unknown
Write rate - unknown
Transactions started - unknown
Transactions committed - unknown
Conflict rate - unknown

Backup and DR:
Running backups - 0
Running DRs - 0

Process performance details:
65.108.100.113:4500:tls ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4501:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4502:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4503:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4504:tls ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.5 GB / 8.0 GB RAM )
65.108.100.113:4505:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4506:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4507:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4508:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4509:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4510:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4511:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4500:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4501:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.115:4502:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.115:4503:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.115:4504:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.5 GB / 8.0 GB RAM )
65.108.100.115:4505:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4506:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4507:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4508:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4509:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4510:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4511:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4500:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4501:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4502:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4503:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4504:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4505:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4506:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4507:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4508:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4509:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4510:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4511:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )

Coordination servers:
65.108.100.113:4500:tls (reachable)
65.108.100.115:4500:tls (reachable)
65.108.100.116:4500:tls (reachable)

Client time: 03/17/22 12:34:30

fdbcli status json

I guess that cluster stuck in recovery state because I see the following in status:

"generation" : 45,

"recovery_state" : {
    "active_generations" : 7,
    "description" : "Accepting commits. All logs recruited.",
    "name" : "all_logs_recruited"
}

Also I see in the log events related to recovery, one of them is MasterRecoveryDuration

List of unique events
Severity="30"
MasterRecoveryDuration
RkSSListFetchTimeout

Severity="20"
N2_ReadError
Net2RunLoopTrace

Severity="10"
ClusterControllerMetrics
ClusterGetStatus
ConnectionClosed
ConnectionEstablished
ConnectionFrom
CoordinationPing
CoordinationPingSent
GenerationRegReadReply
GenerationRegReadRequest
GetLeaderReply
GetValueAttachID
GetValueDebug
Histogram
IncomingConnection
MachineLoadDetail
MachineMetrics
MasterMetrics
MasterRecoveredConfig
MasterRecoveryGenerations
MasterRecoveryState
MemoryMetrics
MonitorLeaderForProxiesGotClientInfo
NetworkMetrics
NotifyAddressFailed
NotifyAddressHealthy
PeerDestroy
PingLatency
ProcessMetrics
ProcessTimeOffset
ProgramStart
RecruitedMasterWorker
ResolverMetrics
RkUpdate
RkUpdateBatch
Role
SlowTask
SomewhatSlowRunLoopTop
TransactionDebug
TransactionMetrics
Examples of interesting events
<Event Severity="30" Time="1647515695.739040" OriginalTime="1647432679.971453" DateTime="2022-03-17T11:14:55Z" OriginalDateTime="2022-03-16T12:11:19Z" Type="MasterRecoveryDuration" ID="0469ded4fbeabedf" RecoveryDuration="16.8381" Machine="65.108.100.113:4505" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1647515695.739040" OriginalTime="1647432663.487543" DateTime="2022-03-17T11:14:55Z" OriginalDateTime="2022-03-16T12:11:03Z" Type="MasterRecoveryGenerations" ID="0469ded4fbeabedf" ActiveGenerations="7" Machine="65.108.100.113:4505" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1647515695.739040" OriginalTime="1647432679.971453" DateTime="2022-03-17T11:14:55Z" OriginalDateTime="2022-03-16T12:11:19Z" Type="MasterRecoveryState" ID="0469ded4fbeabedf" StatusCode="12" Status="all_logs_recruited" Machine="65.108.100.113:4505" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1647491031.065689" DateTime="2022-03-17T04:23:51Z" Type="RkSSListFetchTimeout" ID="9fea49a04a5d1838" SuppressedEventCount="19" Machine="65.108.100.113:4500" LogGroup="default" Roles="CD,RK" />
<Event Severity="20" Time="1647437668.876271" DateTime="2022-03-16T13:34:28Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1647437668.792293" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7fecbf5fe730 0x7fecbf5fb5a1 0x2378b7f 0x237a428 0x79c212 0x7fecbf2c709b 0x7dd675" Machine="65.108.100.113:4506" LogGroup="default" />
<Event Severity="20" Time="1647516990.675524" DateTime="2022-03-17T11:36:30Z" Type="N2_ReadError" ID="86a0e5e07afd550c" SuppressedEventCount="0" ErrorCode="1" Message="stream truncated" Machine="65.108.100.113:4511" LogGroup="default" Roles="CC" />

Can I do somethings with that? How much time can recovery process get?

Since recovery_state is “Accepting commits. All logs recruited.”, this means your cluster has finished recovery and ready to accept transaction commits.

The problem is more likely that Ratekeeper is throttling traffic, and I see this event Type="RkSSListFetchTimeout" ID="9fea49a04a5d1838" SuppressedEventCount="19" Machine="65.108.100.113:4500" LogGroup="default" Roles="CD,RK". This indicates that Ratekeeper is setting TPS limit to 0 for reason 11 (storage_server_list_fetch_failed). Essentially the storage server(s) that store the storage server list is busy or failed. So you might want to check if there are unrecovered storage servers and try to bring them back.