Hi, I’m trying to debug uneven data distribution from my cluster. I have a cluster that initially had 4 storage servers, all on one data drive, and then last week I added 4 new storage servers on a new drive. Ever since it has finished repartitioning, the stored_bytes
is higher on the 4 new servers. This makes the IO load much heavier on the new drive.
One thing I’ve also been looking at is the slow reclamation of space after expanding a cluster, so yesterday I made spring cleaning vacuuming more aggressive on the old storage servers. Since doing that I’ve noticed the stored_bytes
start to even out between the services. (As in, data from the newer servers started moving back to older servers) I started looking at DataDistribution to see how space taken by sqlite free pages contributes to the utilization of each server, and I don’t think I’m reading this correctly.
I’m looking at the getLoadBytes logic: https://github.com/apple/foundationdb/blob/a9366f39b59453ee0bf0d4e08c7a556b62ec898f/fdbserver/DataDistribution.actor.cpp#L221-L236
Where load is defined by
(physicalBytes + (inflightPenalty*inFlightBytes)) * availableSpaceMultiplier
My understanding is:
- Physical bytes: Sum of size of all key/values owned by this team (https://github.com/apple/foundationdb/blob/a9366f39b59453ee0bf0d4e08c7a556b62ec898f/fdbserver/StorageMetrics.actor.h#L388) I am using single replication, so I assume every storage server is its own team.
- Available space multiplier: (Assuming amount of space left > available space ratio cutoff) avail space cutoff / ratio of avail space. So it reduces the load if it has a massive amount of available space left. And available space seems to include space taken by free pages (https://github.com/apple/foundationdb/blob/master/fdbserver/KeyValueStoreSQLite.actor.cpp#L1973)
So for the two metrics it uses, available_bytes and stored_bytes. Neither should (?) be affected by vacuuming, since sqlite free pages vs free OS space both contribute to available_bytes. Then why did it start evening out after I started vacuuming? And at some point the servers had reached equal data distribution, why did it keep streaming data to the newer servers at that point?
On a side note, I’ve noticed that in the case of cluster expansion, basically no disk space was being reclaimed using default spring cleaning settings (even for weeks/months following the expansion). I set VACUUMS_PER_LAZY_DELETE_PAGE
to 1 and it started reclaiming space at a steady pace. Why is it set to 0 by default?