Data distribution / Disk usage uneven: bifurcated at 2 tiers

Data distribution at our FDB cluster’s DC1 has been relatively even, but DC3 has become uneven recently.

Five days ago, the distribution at DC1 was very even: disk usages of all nodes (Kubernetes pods in our case) were at 67%.

However at DC3, the disk usages are uneven: some at 61%, some at 71%, and one pod got 79%. We got an alert (for greater than 75%). The 79% spike came down one day later, but the disk usages has stayed at a bifurcated state of 2 tiers for disk usages: about one-third of all 60 storage pods at the lower tier, and other two-third at the higher tier.

We loaded data recently. The values of the 2 tiers have increased, but the difference between the 2 tiers (about 10 percentage points) remains. Here are sample values of the 2 tiers:
– 61% and 71% (five days ago)
– 63% and 72% (four days ago)
– 64% and 74% (today)

Here are the disk usages for the first 10 storage pods today:

POD_NAME                                 Size  Used Avail Use% Mounted on
---------------------------------------- ----- ---- ----- ---- ----------------------
xid2-storage-01-5b5f599588-wp59h         340G  251G  90G  74%  /var/lib/foundationdb
xid2-storage-02-76c6b46c4-dj6z6          340G  217G 124G  64%  /var/lib/foundationdb
xid2-storage-03-69b66fc568-qbvg2         340G  251G  90G  74%  /var/lib/foundationdb
xid2-storage-04-c95ddcb98-6ljff          340G  252G  89G  74%  /var/lib/foundationdb
xid2-storage-05-6cd5f5b77b-lbb5v         340G  252G  89G  75%  /var/lib/foundationdb
xid2-storage-06-78bc5fb4db-mqzc5         340G  216G 125G  64%  /var/lib/foundationdb
xid2-storage-07-6966fb97fb-j6jsx         340G  251G  90G  74%  /var/lib/foundationdb
xid2-storage-08-7555965d4b-67h8r         340G  217G 124G  64%  /var/lib/foundationdb
xid2-storage-09-55946fdcb4-kdh7f         340G  250G  91G  74%  /var/lib/foundationdb
xid2-storage-10-6c6cd68cd4-mvcw8         340G  250G  91G  74%  /var/lib/foundationdb

Is there a way to fix this 2-tiered data distribution issue?

Thanks.

Is there any hot spot among these storage servers? If the low-space-utilized storage servers are heavily read or write, they may stay with less data for load balancing purpose. DD uses both the space and write throughput to balance data among SSes.

How about the data movement activity on the cluster? Event RelocateShard can be useful.

The events BgDDMountainChopper and BgDDValleyFiller are also useful. They shows what decisions DD made to balance data among SSes.

If all of above does not work, does the problem still exist if you kill the DD role and let it restart?

How does reducing data size for given SS help in balancing of load?

If FDB notices that most updates are against a small key range, it will split the range onto two storage servers. That way, two storage servers handle the updates instead of one.

FDB also tries to balance space utilization across nodes, but sometimes the two goals conflict.

If the emptier machines tend to be busier than the full machines, or if you expect lots of shifting write workloads in your setup, then it suggests the problem might have something to do with load balancing instead of space balancing.

Meng, I pulled sample records for the events you mentioned. Don’t know how to interpret them. Can you tell me what you see? Thanks.

/var/log/foundationdb# g -w RelocateShard *  | head -3
trace.10.104.219.200.4300.1591742762.lLtRwM.2.571.xml:<Event Severity="10" Time="1596437497.095264" Type="RelocateShard" ID="62a7ab6bb8577a81" BeginPair="bd53650c6553f07e" KeyBegin="\x159\x01P\x00\xff\x00\xff\x03\xa92\xd1" KeyEnd="\x159\x01P\x00\xff\x00\xff\x03\xae\xf7?" Priority="121" RelocationID="bd53650c6553f07e" SuppressedEventCount="0" Machine="10.104.219.200:4300" LogGroup="default" Roles="DD" />
trace.10.104.219.200.4300.1591742762.lLtRwM.2.571.xml:<Event Severity="10" Time="1596437518.569795" Type="RelocateShard" ID="62a7ab6bb8577a81" EndPair="6f6235e5c34360f2" Duration="83.9866" Result="Success" Machine="10.104.219.200:4300" LogGroup="default" Roles="DD" />
trace.10.104.219.200.4300.1591742762.lLtRwM.2.571.xml:<Event Severity="10" Time="1596437519.172733" Type="RelocateShard" ID="62a7ab6bb8577a81" BeginPair="22e7cc42dba0b24b" KeyBegin="\x159\x01X\x00\xff\x00\xff\x02\xd6\x19\xa3" KeyEnd="\x159\x01X\x00\xff\x00\xff\x02\xd9w\xc7" Priority="121" RelocationID="22e7cc42dba0b24b" SuppressedEventCount="0" Machine="10.104.219.200:4300" LogGroup="default" Roles="DD" />

/var/log/foundationdb# g -w BgDDMountainChopper *  | head -3
trace.10.104.219.200.4300.1591742762.lLtRwM.2.571.xml:<Event Severity="10" Time="1596437497.080237" Type="BgDDMountainChopper" ID="62a7ab6bb8577a81" SourceBytes="97009257126" DestBytes="96134743075" ShardBytes="193411567" AverageShardBytes="175131940" SourceTeam="Size 3; 10.104.216.165:4500:tls 1dcaa3a49eefa427, 10.104.222.138:4500:tls 9781fc81b9c02df2, 10.104.218.181:4500:tls c2842996384c3e8c" DestTeam="Size 3; 10.104.222.182:4501:tls 64106db0bb815006, 10.104.220.188:4501:tls b471df16ae348822, 10.104.219.166:4500:tls ca906961ccf88e40" Machine="10.104.219.200:4300" LogGroup="default" Roles="DD" />
trace.10.104.219.200.4300.1591742762.lLtRwM.2.571.xml:<Event Severity="10" Time="1596437519.157230" Type="BgDDMountainChopper" ID="62a7ab6bb8577a81" SourceBytes="97077445149" DestBytes="95776258460" ShardBytes="192051255" AverageShardBytes="175131940" SourceTeam="Size 3; 10.104.216.165:4500:tls 1dcaa3a49eefa427, 10.104.222.138:4500:tls 9781fc81b9c02df2, 10.104.218.181:4500:tls c2842996384c3e8c" DestTeam="Size 3; 10.104.221.170:4500:tls 17f8dce48198edff, 10.104.222.172:4501:tls 3f30e62f68d5ea66, 10.104.219.166:4501:tls cdfec94d3910a996" Machine="10.104.219.200:4300" LogGroup="default" Roles="DD" />
trace.10.104.219.200.4300.1591742762.lLtRwM.2.571.xml:<Event Severity="10" Time="1596437529.548800" Type="BgDDMountainChopper" ID="62a7ab6bb8577a81" SourceBytes="97104355369" DestBytes="95696703530" ShardBytes="203295749" AverageShardBytes="175131940" SourceTeam="Size 3; 10.104.216.165:4500:tls 1dcaa3a49eefa427, 10.104.222.138:4500:tls 9781fc81b9c02df2, 10.104.218.181:4500:tls c2842996384c3e8c" DestTeam="Size 3; 10.104.222.172:4501:tls 3f30e62f68d5ea66, 10.104.218.166:4501:tls 6e3c03a54d7957da, 10.104.219.195:4500:tls 6ea914748c494668" Machine="10.104.219.200:4300" LogGroup="default" Roles="DD" />

/var/log/foundationdb# g -w BgDDValleyFiller *  | head -3
trace.10.104.219.200.4300.1591742762.lLtRwM.2.571.xml:<Event Severity="10" Time="1596437530.509963" Type="BgDDValleyFiller" ID="62a7ab6bb8577a81" SourceBytes="97150749865" DestBytes="68060856830" ShardBytes="186806761" AverageShardBytes="175131940" SourceTeam="Size 3; 10.104.220.203:4501:tls 44d0c699bd372f5c, 10.104.218.129:4501:tls 74bc3fd9a5088153, 10.104.222.162:4501:tls f778cd1298f9fa8b" DestTeam="Size 3; 10.104.218.238:4501:tls 40864c06951a1db7, 10.104.218.166:4501:tls 6e3c03a54d7957da, 10.104.222.148:4500:tls 7e85d82e329e99cc" Machine="10.104.219.200:4300" LogGroup="default" Roles="DD" />
trace.10.104.219.200.4300.1591742762.lLtRwM.2.571.xml:<Event Severity="10" Time="1596437554.751788" Type="BgDDValleyFiller" ID="62a7ab6bb8577a81" SourceBytes="96866242568" DestBytes="68154018253" ShardBytes="186967912" AverageShardBytes="175131940" SourceTeam="Size 3; 10.104.220.138:4501:tls 8d9112406bd383d2, 10.104.222.138:4500:tls 9781fc81b9c02df2, 10.104.218.181:4500:tls c2842996384c3e8c" DestTeam="Size 3; 10.104.218.238:4500:tls 0bd6c45fe34b8518, 10.104.222.184:4501:tls a9adc56207b6e65f, 10.104.219.197:4500:tls f7c606591793bf91" Machine="10.104.219.200:4300" LogGroup="default" Roles="DD" />
trace.10.104.219.200.4300.1591742762.lLtRwM.2.571.xml:<Event Severity="10" Time="1596437581.004487" Type="BgDDValleyFiller" ID="62a7ab6bb8577a81" SourceBytes="96962906782" DestBytes="68412667180" ShardBytes="223467332" AverageShardBytes="175131940" SourceTeam="Size 3; 10.104.220.138:4501:tls 8d9112406bd383d2, 10.104.222.138:4500:tls 9781fc81b9c02df2, 10.104.218.181:4500:tls c2842996384c3e8c" DestTeam="Size 3; 10.104.218.238:4500:tls 0bd6c45fe34b8518, 10.104.218.180:4500:tls d7ad8c9c8238a86c, 10.104.220.173:4500:tls eb885625d03dd08b" Machine="10.104.219.200:4300" LogGroup="default" Roles="DD" />

Sorry, the statement was incorrect.

When DD balance between SSes, it only considers space. But DD splits shards based on write throughput. The split shards, hopefully, can be balanced by DD when DD balances the space.

RelocateShard tells you what types of data movement is happening from the Priority field. The priority definition is at https://github.com/xumengpanda/foundationdb/blob/859c3145e2830a7f409ee1de3ec7baae11274759/fdbserver/Knobs.cpp#L123-L136

You can aggregate those events to see what are the major reasons for DD to move data.
Priority 121 indicates it is rebalancing for overutilized team, which means DD is doing work to rebalance data. But that is only one event. It has to use aggregated data to get meaningful conclusion.

BgDDMountainChopper and BgDDValleyFiller tells you what data movement decisions DD is making: Moving some shard from the SourceTeam to the DestTeam . The logic is at here: https://github.com/xumengpanda/foundationdb/blob/859c3145e2830a7f409ee1de3ec7baae11274759/fdbserver/DataDistributionQueue.actor.cpp#L1208
These events may give a rough idea about how DD is moving data: from which SS to which SS.

Note: Team is a set of 3 storage servers that stores the same data. A storage server can be grouped with other SSes into many teams. A shard is assigned to a SS team. The three SSes in the team are the three replicas of the shard. For example, we have two teams, Team1 = {SS1, SS2, SS3} and Team2 = {SS1, SS2, SS4}. Now we assign shard1 to team1 and shard2 to team2. Now SS1 has both shard1 and shard2, while SS4 has only SS4. So if the localities of the SSes are skewed a lot, say some fault-tolerance zone (by default it is zoneId) has a lot of SSes while the other has very few SSes, then the zone with very few SSes may be forced to host more data. One event related to teams is TeamCollectionInfo . The event does not provide detailed team information :frowning: but the overview info it provides sometimes can be useful.