Repartition from cluster expansion results in uneven data distribution

Hi, I’m trying to debug uneven data distribution from my cluster. I have a cluster that initially had 4 storage servers, all on one data drive, and then last week I added 4 new storage servers on a new drive. Ever since it has finished repartitioning, the stored_bytes is higher on the 4 new servers. This makes the IO load much heavier on the new drive.

One thing I’ve also been looking at is the slow reclamation of space after expanding a cluster, so yesterday I made spring cleaning vacuuming more aggressive on the old storage servers. Since doing that I’ve noticed the stored_bytes start to even out between the services. (As in, data from the newer servers started moving back to older servers) I started looking at DataDistribution to see how space taken by sqlite free pages contributes to the utilization of each server, and I don’t think I’m reading this correctly.

I’m looking at the getLoadBytes logic: https://github.com/apple/foundationdb/blob/a9366f39b59453ee0bf0d4e08c7a556b62ec898f/fdbserver/DataDistribution.actor.cpp#L221-L236

Where load is defined by

(physicalBytes + (inflightPenalty*inFlightBytes)) * availableSpaceMultiplier

My understanding is:

So for the two metrics it uses, available_bytes and stored_bytes. Neither should (?) be affected by vacuuming, since sqlite free pages vs free OS space both contribute to available_bytes. Then why did it start evening out after I started vacuuming? And at some point the servers had reached equal data distribution, why did it keep streaming data to the newer servers at that point?

On a side note, I’ve noticed that in the case of cluster expansion, basically no disk space was being reclaimed using default spring cleaning settings (even for weeks/months following the expansion). I set VACUUMS_PER_LAZY_DELETE_PAGE to 1 and it started reclaiming space at a steady pace. Why is it set to 0 by default?

1 Like