Hi~ In the deployment practice of pushing FDB to a large-scale (60 nodes, 480 disks, 1320 processes. When we deploy only half the size, there is no problem) cluster, we found that the Replication health
state of the cluster has been in (Re)initializing automatic data distribution
Checking the log found that DD was accompanied by some errors, and it was still unable to successfully recruit a healthy server team.
# fdbserver --version
FoundationDB 7.1 (v7.1.27)
source version dc4bd7ef6fcda345276073599ed6a2b28b089e90
protocol fdb00b071010000
# fdbcli
Using cluster file `/etc/foundationdb/fdb.cluster'.
The database is available.
Welcome to the fdbcli. For help, type `help'.
fdb>
fdb> status
Using cluster file `/etc/foundationdb/fdb.cluster'.
Configuration:
Redundancy mode - three_datacenter
Storage engine - ssd-2
Coordinators - 9
Desired Logs - 12
Usable Regions - 1
Cluster:
FoundationDB processes - 1320
Zones - 60
Machines - 60
Memory availability - 8.0 GB per process on machine with least available
Retransmissions rate - 109 Hz
Fault Tolerance - 3 machines
Server time - 02/15/23 14:26:30
Data:
Replication health - (Re)initializing automatic data distribution
Moving data - unknown (initializing)
Sum of key-value sizes - unknown
Disk space used - 114.535 GB
Operating space:
Storage server - 6855.3 GB free on most full server
Log server - 6855.4 GB free on most full server
Workload:
Read rate - 257 Hz
Write rate - 0 Hz
Transactions started - 14 Hz
Transactions committed - 1 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Client time: 02/15/23 14:26:29
fdb>
The relevant log snippets are as follows:
<Event Severity="10" Time="1676469574.195800" DateTime="2023-02-15T13:59:34Z" Type="DumpToken" ID="9647cfdd6dc42c37" Name="recruited.waitFailure" Token="050cab77aaa22633" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.195800" DateTime="2023-02-15T13:59:34Z" Type="DatabaseContextCreated" ID="027acbf78b76f6a4" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x3753c3c 0x2d64121 0x2d676e6 0x1fe9e37 0x12e8184 0x2009768 0x2009a1e 0xc944a4 0x355b1f3 0x355b5db 0x11e8760 0x36ee806 0xa776e9 0x7f8620a9e555" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.195800" DateTime="2023-02-15T13:59:34Z" Type="DataDistributorRunning" ID="9647cfdd6dc42c37" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.195800" DateTime="2023-02-15T13:59:34Z" Type="DatabaseContextCreated" ID="a4975c5989d98d07" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x3753c3c 0x2d64121 0x2d676e6 0x1fe9e37 0x12ced0e 0x12cf463 0x12e7e00 0x12e8210 0x2009768 0x2009a1e 0xc944a4 0x355b1f3 0x355b5db 0x11e8760 0x36ee806 0xa776e9 0x7f8620a9e555" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
...
<Event Severity="10" Time="1676469574.210912" DateTime="2023-02-15T13:59:34Z" Type="DDInitTookMoveKeysLock" ID="9647cfdd6dc42c37" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.212116" DateTime="2023-02-15T13:59:34Z" Type="DDInitGotConfiguration" ID="9647cfdd6dc42c37" Conf="{"backup_worker_enabled":0,"blob_granules_enabled":0,"log_spill":2,"logs":12,"perpetual_storage_wiggle":0,"perpetual_storage_wiggle_locality":"0","redundancy_mode":"three_datacenter","storage_engine":"ssd-2","storage_migration_type":"disabled","tenant_mode":"disabled","usable_regions":1}" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.213482" DateTime="2023-02-15T13:59:34Z" Type="DDInitUpdatedReplicaKeys" ID="9647cfdd6dc42c37" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.234713" DateTime="2023-02-15T13:59:34Z" Type="HugeArenaSample" ID="0000000000000000" Count="1" Size="78084" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x36ac71e 0x368dc48 0x368e0d6 0x355eb80 0x11e8760 0x36ee806 0xa776e9 0x7f8620a9e555" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.256349" DateTime="2023-02-15T13:59:34Z" Type="DDInitGotInitialDD" ID="9647cfdd6dc42c37" B="" E="\xff\xff" Src="4323f8147a78392a,6326c89b8eb4ad4f,7064281b2b53e9b7,7e3c88d189df961f,8f694905f7c2e94d,f7c84302691f7954" Dest="[no items]" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="10" Time="1676469574.256975" DateTime="2023-02-15T13:59:34Z" Type="TrackInitialShards" ID="9647cfdd6dc42c37" InitialShardCount="2" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.256975" DateTime="2023-02-15T13:59:34Z" Type="DDTrackerStarting" ID="9647cfdd6dc42c37" State="Inactive" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="10" Time="1676469574.256975" DateTime="2023-02-15T13:59:34Z" Type="AddedStorageServer" ID="9647cfdd6dc42c37" ServerID="540c6d4faf750f00" ProcessID="157f53f4c3dde10629100fe6025c22fb" ProcessClass="storage" WaitFailureToken="7419393a57b74a05" Address="10.214.177.38:5513" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
...
<Event Severity="10" Time="1676469574.256975" DateTime="2023-02-15T13:59:34Z" Type="AddedStorageServer" ID="9647cfdd6dc42c37" ServerID="fb8efa60867cf8ff" ProcessID="48ca5e1775e1139282b9467758701e78" ProcessClass="storage" WaitFailureToken="7df731c680f8410d" Address="10.249.122.29:5506" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.256975" DateTime="2023-02-15T13:59:34Z" Type="ServerTeamTrackerStarting" ID="9647cfdd6dc42c37" Reason="Initial wait complete (sc)" ServerTeam="TeamID f09682a5f93ab824; Size 6; 10.227.154.32:5500 4323f8147a78392a, 10.214.177.40:5502 6326c89b8eb4ad4f, 10.249.122.28:5513 7064281b2b53e9b7, 10.227.154.38:5513 7e3c88d189df961f, 10.214.177.26:5513 8f694905f7c2e94d, 10.249.122.39:5505 f7c84302691f7954" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.256975" DateTime="2023-02-15T13:59:34Z" Type="ServerTeamHealthChangeDetected" ID="9647cfdd6dc42c37" ServerTeam="TeamID f09682a5f93ab824; Size 6; 10.227.154.32:5500 4323f8147a78392a, 10.214.177.40:5502 6326c89b8eb4ad4f, 10.249.122.28:5513 7064281b2b53e9b7, 10.227.154.38:5513 7e3c88d189df961f, 10.214.177.26:5513 8f694905f7c2e94d, 10.249.122.39:5505 f7c84302691f7954" Primary="1" IsReady="0" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.256975" DateTime="2023-02-15T13:59:34Z" Type="TeamCollectionInfo" ID="9647cfdd6dc42c37" Primary="1" AddedTeams="0" TeamsToBuild="0" CurrentServerTeams="0" DesiredTeams="5400" MaxTeams="27000" StorageTeamSize="6" CurrentMachineTeams="0" CurrentHealthyMachineTeams="0" DesiredMachineTeams="300" MaxMachineTeams="1500" TotalHealthyMachines="60" MinTeamsOnServer="0" MaxTeamsOnServer="0" MinMachineTeamsOnMachine="0" MaxMachineTeamsOnMachine="0" DoBuildTeams="1" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="20" Time="1676469574.389380" DateTime="2023-02-15T13:59:34Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1676469574.389671" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7f8620e59630 0x7f8620e58aeb 0x36f5b53 0x370aa84 0x355e811 0x11e8760 0x36ee806 0xa776e9 0x7f8620a9e555 0xae1a42" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.417293" DateTime="2023-02-15T13:59:34Z" Type="ServerTeamHealthChangeDetected" ID="9647cfdd6dc42c37" ServerTeam="TeamID f09682a5f93ab824; Size 6; 10.227.154.32:5500 4323f8147a78392a, 10.214.177.40:5502 6326c89b8eb4ad4f, 10.249.122.28:5513 7064281b2b53e9b7, 10.227.154.38:5513 7e3c88d189df961f, 10.214.177.26:5513 8f694905f7c2e94d, 10.249.122.39:5505 f7c84302691f7954" Primary="1" IsReady="0" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.417293" DateTime="2023-02-15T13:59:34Z" Type="TeamCollectionInfo" ID="9647cfdd6dc42c37" Primary="1" AddedTeams="0" TeamsToBuild="0" CurrentServerTeams="1" DesiredTeams="5400" MaxTeams="27000" StorageTeamSize="6" CurrentMachineTeams="1" CurrentHealthyMachineTeams="1" DesiredMachineTeams="300" MaxMachineTeams="1500" TotalHealthyMachines="60" MinTeamsOnServer="0" MaxTeamsOnServer="1" MinMachineTeamsOnMachine="0" MaxMachineTeamsOnMachine="1" DoBuildTeams="1" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="10" Time="1676469574.418146" DateTime="2023-02-15T13:59:34Z" Type="DDTeamCollectionBegin" ID="9647cfdd6dc42c37" Primary="1" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.418146" DateTime="2023-02-15T13:59:34Z" Type="DDTeamCollectionReadyToStart" ID="9647cfdd6dc42c37" Primary="1" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.418146" DateTime="2023-02-15T13:59:34Z" Type="TeamCollectionInfo" ID="9647cfdd6dc42c37" Primary="1" AddedTeams="0" TeamsToBuild="0" CurrentServerTeams="1" DesiredTeams="5400" MaxTeams="27000" StorageTeamSize="6" CurrentMachineTeams="1" CurrentHealthyMachineTeams="1" DesiredMachineTeams="300" MaxMachineTeams="1500" TotalHealthyMachines="60" MinTeamsOnServer="0" MaxTeamsOnServer="1" MinMachineTeamsOnMachine="0" MaxMachineTeamsOnMachine="1" DoBuildTeams="1" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="10" Time="1676469574.418146" DateTime="2023-02-15T13:59:34Z" Type="DDRecruiting" ID="0000000000000000" Primary="1" State="Sending request to CC" Exclusions="1080" Critical="0" IncludedDCsSize="0" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.418146" DateTime="2023-02-15T13:59:34Z" Type="StorageServerRecruitment" ID="9647cfdd6dc42c37" State="Idle" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="10" Time="1676469574.418146" DateTime="2023-02-15T13:59:34Z" Type="DDMonitorHealthyTeamsStart" ID="0000000000000000" ZeroHealthyTeams="0" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.418146" DateTime="2023-02-15T13:59:34Z" Type="TotalDataInFlight" ID="9647cfdd6dc42c37" Primary="1" TotalBytes="0" UnhealthyServers="0" ServerCount="1080" StorageTeamSize="6" HighestPriority="140" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="10" Time="1676469574.423461" DateTime="2023-02-15T13:59:34Z" Type="DDTrackerStats" ID="9647cfdd6dc42c37" Shards="1" TotalSizeBytes="2005500" SystemSizeBytes="0" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="10" Time="1676469574.424501" DateTime="2023-02-15T13:59:34Z" Type="PerpetualStorageWiggleClose" ID="9647cfdd6dc42c37" Primary="1" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1676469574.425148" DateTime="2023-02-15T13:59:34Z" Type="DDExcludedServersChanged" ID="9647cfdd6dc42c37" AddressesExcluded="0" AddressesFailed="0" LocalitiesExcluded="0" LocalitiesFailed="0" ThreadID="9846941102884159635" Machine="10.214.177.33:7500" LogGroup="default" Roles="DD" />
Is this because the ceiling of the fdb cluster size has been reached?