Used disk space dramatically increases while sum of key-value sizes is constant

We are testing FDB in double ssd configuration on three computers in the following scenario:

  1. The first process continuously puts new key-values with fixed parallelism (64 threads).
  2. The second process deletes each key-value in 5 hours after it has been inserted.

After two days of running this test no free space left on disk and performance has significantly descreased and become unstable.

In more details: During the first two days the put rate was ~20000 per second. The sum of key-value sizes remained the same (~70 GB) but the used disk space increased from 200 GB to 1500 GB and finally there remained no free space on disks. After it the test continued to run but the put rate has dropped to 4000-10000 per second and was very unstable.

Our expectations for this scenario are:

  1. Used disk space will not grow infinitely (until no free space remains on disk).
  2. The performance will be stable.

What should we do for it?

Are you using range clears to delete your keys? Range clears are designed to be applied quickly but defer the time-consuming clean up work. It’s possible that your workload is writing data faster than it’s actually being cleared off of the disk.

There are some knobs that we could play with to increase the rate that deferred cleanup work is done, but I think it may come at the cost of increased latencies. Another option is to use more disks (or maybe even more storage processes per storage disk) to increase the amount of total effort applied to cleaning up cleared data.

No, we do not use range clears. All keys are cleared individually.

Thank you for the advice. We will try to use more storage processes per storage disk.

I’m not sure if the same thing applies to point clears. It’s possible doing a sufficient number of them can lead to similar behavior.

We have found that in our case FDB disk garbage collection works well at rates below 8000 puts+clears per second. At higher rates used disk space begins to grow. When we doubled the number of storage processes this rate limit increased only by 15%. So a significant increase in throughput can not be achieved by increasing the number of storage processes.

Could you please tell which knobs to use and how?
We have found the following in knobs.h but have no idea how to use them:
double CLEANING_INTERVAL;
double SPRING_CLEANING_TIME_ESTIMATE;
double SPRING_CLEANING_VACUUMS_PER_LAZY_DELETE_PAGE;
int SPRING_CLEANING_MIN_LAZY_DELETE_PAGES;
int SPRING_CLEANING_MAX_LAZY_DELETE_PAGES;
int SPRING_CLEANING_LAZY_DELETE_BATCH_SIZE;
int SPRING_CLEANING_MIN_VACUUM_PAGES;
int SPRING_CLEANING_MAX_VACUUM_PAGES;

I don’t know offhand exactly what kind of rates are reasonable to expect per storage server, but it’s probably more general to measure these in bytes/s (rather than ops/s) because the clean up happens in units of 4K pages. It’s also of note that the size of the data on disk will be subject to some overhead and possibly replication, depending on your configuration, so there are other variables which can impact the speed relative to the size of the kv-pairs being inserted.

I would expect that you’d see a much better improvement than this in general. Assuming that your activity is well distributed throughout the cluster (i.e. the cleanup is happening everywhere rather than just a few storage servers), when adding storage servers using separate disks I’d expect the improvement to be roughly linear.

For storage servers that share a disk, I would still expect you to see better scaling than that unless your disks are fairly busy, as the cleanup is only allotted something like 10ms every second to run. If they are indeed too busy to support the extra work, it’s possible that changing the knobs will have a similarly reduced effect, but maybe not.

One other possibility is that if you added extra storage processes to the same disk which already had large files taking up most of the disk from the existing storage processes, then the bulk of the data is being held in those original files, making them responsible for most of the cleanup work. Data movement should eventually be able to balance things out (assuming there’s enough free space to actually move data around), but that activity generates its own cleanup work which could skew your measurements.

All of the knobs you listed relate to this cleanup process. Probably the easiest one to change is CLEANING_INTERVAL.

CLEANING_INTERVAL - how often the cleaning gets run (default 1s)
SPRING_CLEANING_TIME_ESTIMATE - how long to run spring cleaning each time (default 10ms)
SPRING_CLEANING_MIN_LAZY_DELETE_PAGES - minimum pages to lazily delete each round (default 0)
SPRING_CLEANING_MAX_LAZY_DELETE_PAGES - maximum pages to lazily delete each round (default 1e9)
SPRING_CLEANING_LAZY_DELETE_BATCH_SIZE - number of pages to lazily delete in a batch (default 100)
SPRING_CLEANING_MIN_VACUUM_PAGES - minimum number of pages to vacuum each round (default 1)
SPRING_CLEANING_MAX_VACUUM_PAGES - maximum number of pages to vacuum each round (default 1e9)
SPRING_CLEANING_VACUUMS_PER_LAZY_DELETE_PAGE - how many pages to vacuum per page that is lazily deleted (default 0). Used to control which process gets priority, which by default will be lazy deletion.

Note that if a single lazy delete batch takes longer than the time estimate, changing the time estimate may not have much effect. For example, if a batch takes 20ms and the time estimate is 10ms, then increasing the estimate to 20ms will still only support a single batch.

Lazy deletion is the process of making cleared pages available for reuse within the file. Vacuuming is the process of returning unused pages that have already been deleted by the lazy deletion process to the OS.

To set a knob on the server process, use the --knob argument to fdbserver. For example, you could use -- knob_cleaning_interval 0.5 to run spring cleaning twice as often.

Be warned that changing a knob’s value is often not well tested. I think we’ve toyed with some of these particular knobs before and noticed an impact on latencies. There may be other issues lurking, so use caution when changing them.