Data distribution at our FDB cluster’s DC1 has been relatively even, but DC3 has become uneven recently.
Five days ago, the distribution at DC1 was very even: disk usages of all nodes (Kubernetes pods in our case) were at 67%.
However at DC3, the disk usages are uneven: some at 61%, some at 71%, and one pod got 79%. We got an alert (for greater than 75%). The 79% spike came down one day later, but the disk usages has stayed at a bifurcated state of 2 tiers for disk usages: about one-third of all 60 storage pods at the lower tier, and other two-third at the higher tier.
We loaded data recently. The values of the 2 tiers have increased, but the difference between the 2 tiers (about 10 percentage points) remains. Here are sample values of the 2 tiers:
– 61% and 71% (five days ago)
– 63% and 72% (four days ago)
– 64% and 74% (today)
Here are the disk usages for the first 10 storage pods today:
Is there any hot spot among these storage servers? If the low-space-utilized storage servers are heavily read or write, they may stay with less data for load balancing purpose. DD uses both the space and write throughput to balance data among SSes.
How about the data movement activity on the cluster? Event RelocateShard can be useful.
The events BgDDMountainChopper and BgDDValleyFiller are also useful. They shows what decisions DD made to balance data among SSes.
If all of above does not work, does the problem still exist if you kill the DD role and let it restart?
If FDB notices that most updates are against a small key range, it will split the range onto two storage servers. That way, two storage servers handle the updates instead of one.
FDB also tries to balance space utilization across nodes, but sometimes the two goals conflict.
If the emptier machines tend to be busier than the full machines, or if you expect lots of shifting write workloads in your setup, then it suggests the problem might have something to do with load balancing instead of space balancing.
When DD balance between SSes, it only considers space. But DD splits shards based on write throughput. The split shards, hopefully, can be balanced by DD when DD balances the space.
You can aggregate those events to see what are the major reasons for DD to move data.
Priority 121 indicates it is rebalancing for overutilized team, which means DD is doing work to rebalance data. But that is only one event. It has to use aggregated data to get meaningful conclusion.
Note: Team is a set of 3 storage servers that stores the same data. A storage server can be grouped with other SSes into many teams. A shard is assigned to a SS team. The three SSes in the team are the three replicas of the shard. For example, we have two teams, Team1 = {SS1, SS2, SS3} and Team2 = {SS1, SS2, SS4}. Now we assign shard1 to team1 and shard2 to team2. Now SS1 has both shard1 and shard2, while SS4 has only SS4. So if the localities of the SSes are skewed a lot, say some fault-tolerance zone (by default it is zoneId) has a lot of SSes while the other has very few SSes, then the zone with very few SSes may be forced to host more data. One event related to teams is TeamCollectionInfo . The event does not provide detailed team information but the overview info it provides sometimes can be useful.