Used disk space dramatically increases while sum of key-value sizes is constant

I don’t know offhand exactly what kind of rates are reasonable to expect per storage server, but it’s probably more general to measure these in bytes/s (rather than ops/s) because the clean up happens in units of 4K pages. It’s also of note that the size of the data on disk will be subject to some overhead and possibly replication, depending on your configuration, so there are other variables which can impact the speed relative to the size of the kv-pairs being inserted.

I would expect that you’d see a much better improvement than this in general. Assuming that your activity is well distributed throughout the cluster (i.e. the cleanup is happening everywhere rather than just a few storage servers), when adding storage servers using separate disks I’d expect the improvement to be roughly linear.

For storage servers that share a disk, I would still expect you to see better scaling than that unless your disks are fairly busy, as the cleanup is only allotted something like 10ms every second to run. If they are indeed too busy to support the extra work, it’s possible that changing the knobs will have a similarly reduced effect, but maybe not.

One other possibility is that if you added extra storage processes to the same disk which already had large files taking up most of the disk from the existing storage processes, then the bulk of the data is being held in those original files, making them responsible for most of the cleanup work. Data movement should eventually be able to balance things out (assuming there’s enough free space to actually move data around), but that activity generates its own cleanup work which could skew your measurements.

All of the knobs you listed relate to this cleanup process. Probably the easiest one to change is CLEANING_INTERVAL.

CLEANING_INTERVAL - how often the cleaning gets run (default 1s)
SPRING_CLEANING_TIME_ESTIMATE - how long to run spring cleaning each time (default 10ms)
SPRING_CLEANING_MIN_LAZY_DELETE_PAGES - minimum pages to lazily delete each round (default 0)
SPRING_CLEANING_MAX_LAZY_DELETE_PAGES - maximum pages to lazily delete each round (default 1e9)
SPRING_CLEANING_LAZY_DELETE_BATCH_SIZE - number of pages to lazily delete in a batch (default 100)
SPRING_CLEANING_MIN_VACUUM_PAGES - minimum number of pages to vacuum each round (default 1)
SPRING_CLEANING_MAX_VACUUM_PAGES - maximum number of pages to vacuum each round (default 1e9)
SPRING_CLEANING_VACUUMS_PER_LAZY_DELETE_PAGE - how many pages to vacuum per page that is lazily deleted (default 0). Used to control which process gets priority, which by default will be lazy deletion.

Note that if a single lazy delete batch takes longer than the time estimate, changing the time estimate may not have much effect. For example, if a batch takes 20ms and the time estimate is 10ms, then increasing the estimate to 20ms will still only support a single batch.

Lazy deletion is the process of making cleared pages available for reuse within the file. Vacuuming is the process of returning unused pages that have already been deleted by the lazy deletion process to the OS.

To set a knob on the server process, use the --knob argument to fdbserver. For example, you could use -- knob_cleaning_interval 0.5 to run spring cleaning twice as often.

Be warned that changing a knob’s value is often not well tested. I think we’ve toyed with some of these particular knobs before and noticed an impact on latencies. There may be other issues lurking, so use caution when changing them.