Debugging Data Distribution

panghy · October 25, 2018, 3:02am

I’d like to know how I can go about debugging why FDB is or is not moving data. A lot of times a particular process would show up with a ton of CPU and storage queues would go up (throughput suffers), we can pull back on the traffic but DD doesn’t seem to be moving data away from it (I know for a fact that if i just kill -9 the process, the cluster immediately behaves and since this is a double redundancy cluster, it’s not a write hotspot since exactly one SS is hot).

panghy · October 25, 2018, 3:03am

To be fair, these are clusters that are artificially sized to take 1M+ writes (with proxies/resolvers/logs to match)

Evan · November 13, 2018, 5:09am

FoundationDB will only move data for a few reasons:

to restore replication after a failure
to split or merge a shard to a size between ~125MB to 500MB (this range is smaller grows and shrinks with database size, these numbers are the max for the largest databases)
to split or merge a shard because of a lot of recent writes (write hotspot)
to balance the bytes stored across the storage servers

It does not balance based on high read traffic. When it moves a shard it only considers bytes stored, not write traffic.

This means it is possible to randomly assign the same storage server to multiple high read or write traffic ranges.

Your best bet is to look at the StorageMetrics on the hot storage server to see if you observe higher read or write traffic compared to other servers.

You can also look at MovingData and TotalDataInFlight, to see the type of work currently being done by data distribution.

RelocateShardStartSplitx100 will give you a sampled view of why shards are being split, and RelocateShardMergeMetrics will be logged every time two shards are merged.

TeamHealthChanged will tell you when a team of servers has become unhealthy.

panghy · November 14, 2018, 12:30am

Thanks @Evan, yeah, it’s mostly the case that a single process gets too hot and the only recourse is to exclude it and it mostly works. Just wanted to see if there’s a more gentle way to nudging the cluster to move the range.

Topic		Replies	Views
Data distribution / Disk usage uneven: bifurcated at 2 tiers Using FoundationDB	6	528	August 6, 2020
Data distribution and rebalancing Using FoundationDB	3	1308	October 28, 2023
Constantly repartitioning (under no load) and moving large volumes of data Using FoundationDB	15	1621	January 15, 2020
Unexpected repartitioning of the database Using FoundationDB performance	5	1028	July 9, 2020
How to Correctly Detect Data Synchronization Finished with Moving Data In-Flight and Moving Data In-Queue? Running FoundationDB	1	549	February 23, 2022

Debugging Data Distribution

Related topics