I have a v6.3.18 foundation db cluster deployed on 3 nodes (and 12 disks). Recently cloud provider had an outage. One node in my cluster was down for several hours. Cloud provider recovered but foundationd DB stuck in the recovery state (for 40+ hours). Totaly ~10TB stored in the cluster.
foundationdb.conf
user = foundationdb
group = foundationdb
[general]
restart_delay = 60
restart_delay
cluster_file = /etc/foundationdb/fdb.cluster
[fdbserver]
command = /usr/sbin/fdbserver
public_address = auto:$ID:tls
listen_address = public
datadir = /data/$ID
logdir = /var/log/foundationdb
tls_ca_file=/etc/foundationdb/tls/fdbprodCA.pem
tls_certificate_file=/etc/foundationdb/tls/fdbprod.pem
tls_key_file=/etc/foundationdb/tls/fdbprod.key
tls_verify_peers=Check.Valid=0
[fdbserver.4500]
class=stateless
[fdbserver.4501]
class=storage
datadir=/data/nvme1n1/foundationdb/data/$ID
logdir=/data/nvme1n1/foundationdb
[fdbserver.4502]
class=storage
datadir=/data/nvme2n1/foundationdb/data/$ID
logdir=/data/nvme2n1/foundationdb
[fdbserver.4503]
class=storage
datadir=/data/nvme3n1/foundationdb/data/$ID
logdir=/data/nvme3n1/foundationdb
[fdbserver.4504]
class=transaction
datadir=/data/nvme4n1/foundationdb/data/$ID
logdir=/data/nvme4n1/foundationdb
[fdbserver.4505]
class=stateless
[fdbserver.4506]
class=stateless
[fdbserver.4507]
class=stateless
[fdbserver.4508]
class=stateless
[fdbserver.4509]
class=stateless
[fdbserver.4510]
class=stateless
[fdbserver.4511]
class=stateless```
fdbcli status details
Using cluster file `/etc/foundationdb/fdb.cluster’.
Unable to read database configuration.
Configuration:
Redundancy mode - unknown
Storage engine - unknown
Coordinators - unknown
Usable Regions - unknown
Cluster:
FoundationDB processes - 36
Zones - 3
Machines - 3
Memory availability - 8.0 GB per process on machine with least available
Server time - 03/17/22 12:34:35
Data:
Replication health - unknown
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - unknown
Operating space:
Unable to retrieve operating space status
Workload:
Read rate - unknown
Write rate - unknown
Transactions started - unknown
Transactions committed - unknown
Conflict rate - unknown
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
65.108.100.113:4500:tls ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4501:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4502:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4503:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4504:tls ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.5 GB / 8.0 GB RAM )
65.108.100.113:4505:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4506:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4507:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4508:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4509:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4510:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.113:4511:tls ( 0% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4500:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4501:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.115:4502:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.115:4503:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.115:4504:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.5 GB / 8.0 GB RAM )
65.108.100.115:4505:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4506:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4507:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4508:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4509:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4510:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.115:4511:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4500:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4501:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4502:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4503:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4504:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 3.8 GB / 8.0 GB RAM )
65.108.100.116:4505:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4506:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4507:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4508:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4509:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4510:tls ( 0% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
65.108.100.116:4511:tls ( 1% cpu; 0% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 8.0 GB RAM )
Coordination servers:
65.108.100.113:4500:tls (reachable)
65.108.100.115:4500:tls (reachable)
65.108.100.116:4500:tls (reachable)
Client time: 03/17/22 12:34:30
I guess that cluster stuck in recovery state because I see the following in status:
"generation" : 45,
"recovery_state" : {
"active_generations" : 7,
"description" : "Accepting commits. All logs recruited.",
"name" : "all_logs_recruited"
}
Also I see in the log events related to recovery, one of them is MasterRecoveryDuration
List of unique events
Severity="30"
MasterRecoveryDuration
RkSSListFetchTimeout
Severity="20"
N2_ReadError
Net2RunLoopTrace
Severity="10"
ClusterControllerMetrics
ClusterGetStatus
ConnectionClosed
ConnectionEstablished
ConnectionFrom
CoordinationPing
CoordinationPingSent
GenerationRegReadReply
GenerationRegReadRequest
GetLeaderReply
GetValueAttachID
GetValueDebug
Histogram
IncomingConnection
MachineLoadDetail
MachineMetrics
MasterMetrics
MasterRecoveredConfig
MasterRecoveryGenerations
MasterRecoveryState
MemoryMetrics
MonitorLeaderForProxiesGotClientInfo
NetworkMetrics
NotifyAddressFailed
NotifyAddressHealthy
PeerDestroy
PingLatency
ProcessMetrics
ProcessTimeOffset
ProgramStart
RecruitedMasterWorker
ResolverMetrics
RkUpdate
RkUpdateBatch
Role
SlowTask
SomewhatSlowRunLoopTop
TransactionDebug
TransactionMetrics
Examples of interesting events
<Event Severity="30" Time="1647515695.739040" OriginalTime="1647432679.971453" DateTime="2022-03-17T11:14:55Z" OriginalDateTime="2022-03-16T12:11:19Z" Type="MasterRecoveryDuration" ID="0469ded4fbeabedf" RecoveryDuration="16.8381" Machine="65.108.100.113:4505" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1647515695.739040" OriginalTime="1647432663.487543" DateTime="2022-03-17T11:14:55Z" OriginalDateTime="2022-03-16T12:11:03Z" Type="MasterRecoveryGenerations" ID="0469ded4fbeabedf" ActiveGenerations="7" Machine="65.108.100.113:4505" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="10" Time="1647515695.739040" OriginalTime="1647432679.971453" DateTime="2022-03-17T11:14:55Z" OriginalDateTime="2022-03-16T12:11:19Z" Type="MasterRecoveryState" ID="0469ded4fbeabedf" StatusCode="12" Status="all_logs_recruited" Machine="65.108.100.113:4505" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1647491031.065689" DateTime="2022-03-17T04:23:51Z" Type="RkSSListFetchTimeout" ID="9fea49a04a5d1838" SuppressedEventCount="19" Machine="65.108.100.113:4500" LogGroup="default" Roles="CD,RK" />
<Event Severity="20" Time="1647437668.876271" DateTime="2022-03-16T13:34:28Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1647437668.792293" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7fecbf5fe730 0x7fecbf5fb5a1 0x2378b7f 0x237a428 0x79c212 0x7fecbf2c709b 0x7dd675" Machine="65.108.100.113:4506" LogGroup="default" />
<Event Severity="20" Time="1647516990.675524" DateTime="2022-03-17T11:36:30Z" Type="N2_ReadError" ID="86a0e5e07afd550c" SuppressedEventCount="0" ErrorCode="1" Message="stream truncated" Machine="65.108.100.113:4511" LogGroup="default" Roles="CC" />
Can I do somethings with that? How much time can recovery process get?