Is it possible to enable `perpetual_storage_wiggle` at the 'shard' level instead of the 'process' level?

Our engineers have just deployed a data pattern that we’ve not seen before, where they create a very large amount of new data in FDB on a periodic basis, and then clean up old versions of that data. It takes a comparatively long time to build a stable view of the data, and handling mutations to the ‘live’ dataset is complex and error prone, so they build a completely new dataset under a new prefix each (day, week, whatever), adjust a single key which tells the rest of the system which copy to use once it’s stable, and then delete the entire prefix of the old version.

What we are seeing, when this feature is turned on, is our disk usage just constantly increasing even though they are supposedly deleting old prefixes. My investigations of this led me to Data retention after deleting a key range using SSD engine - #6 by gaurav, where I found out that neither ssd-2 (what we were using until a few months ago) or ssd-redwood-1 (what we are using now, on FDB 7.3) clean up old data on disk by default. The data is retained in the on-disk B+Tree, and only actually compacted/deleted when the tree is completely rebuilt as part of the process being excluded.

Based on that, enabling perpetual_storage_wiggle=1 storage_migration_type=gradual should help with our erroneous data retention issues, as to my understanding (including when we enabled it for our ssd-2ssd-redwood-1 migration) is that it effectively excludes a storage process at a time, allows it to empty and completely wipe it’s disk, and then re-includes it to migrate data back via normal rebalancing. As the data is migrated away from the process (and the same or other data later migrated back to it) it is compacted.

The problem is that this means we have to maintain a very high overhead of ‘available’ disk. In our dev clusters in particular, we don’t need high performance but we do have quite a lot of data. So we only have 2 storage processes, each on their own EC2 instance, per replication boundary (we’re running in three_data_hall mode, and our replication boundary is the AWS AZ). That means that each process needs to maintain >50% free disk space to be able to hold the entirety of the data from the other process when it is wiggled.

I’m also not clear on what would happen to the temporarily-moved data after the wiggle was complete. If process A is wiggled, and all it’s data is migrated to process B such that process B is now, say, 80% full on disk, then process A wipes it’s on-disk copies and is re-included so that the K/V data is redistributed between both processes again, will process B remain at 80% disk usage because it hasn’t been completely excluded? So with only 2 storage processes per replication boundary we’re effectively just bouncing that very high usage between them as they are wiggled? Do we need to move to a pattern of more horizontal scaling where we have more processes in a replication boundary such that when one is wiggled the data from it is distributed between the others?

In Redwood: perpetuum moving data between storage servers - #8 by SteavedHams @SteavedHams does say that

These decisions make sense in the context of a production FDB cluster with dedicated data volumes, as giving space back to the filesystem is unnecessary (it will be reused within Redwood) and FDB Data Distribution will move/balance key ranges around between Storage Servers which will effectively remove BTree slack and keep logical space usage roughly balanced.

But surely that’s only true when the data being moved/rebalanced sits in the same place in the BTree? If FDB detects a ‘hot range’ or similar and decides to split it into 2 shards, one of which it moves to another storage process, then it’s quite unlikely that anything else will move onto this process and take up the same ‘space’ in the tree?

Aah, from Maximum file size on ext4 - #3 by SteavedHams

One reason for this is that FDB shuffles data around a good bit in response to writes, and any time the cluster relocates a data range from one Redwood instance to another it is effectively “compacted” as the destination will have a condensed low-slack (or even negative slack) subtree of that data and the source will delete the subtree completely which frees all of its blocks.

So if a shard is moved the entire subtree moves, so disk space can be (and is) reclaimed on the origin process?

I can’t answer the wiggle part, but we were facing a similar issue, (workload that continously inserts new data, but removes old data at the tail) that turned out to be related how redwood handles page slack on page splits.

This is true, but when you wipe the data, the deleted pages should be immediately available to store new data, even though they are not returned to the OS.

You can validate this by opening up the trace logs from a storage server and look for StorageMetrics events, where KvstoreBytesFree will be smaller than KvstoreBytesAvailable. I think this is also exposed in status json for the processes.

It is easiest to validate when there is a log and and a storage role on the same disk, the size avilable for the log will reflect the OS free space, but the size available for the storage will include the deleted (now free) pages.

My understanding is that if you have 100GB of old data, add 100GB of new data, then your OS will report 200GB used data, when you delete the old set, this won’t change. But when you insert the next batch, it should be able to use the free pages, and disk usage should not grow too much beyond 200GB

Wow. That is a very useful thread. I have a lot of stats to go and fetch and several knobs to try tweaking based on the output. Thanks very much!

No, this is not the case. I’m not sure what your mental model of a BTree is but there is no design I am familiar with where freed nodes are restricted to only containing data within the range of values which they previously stored.

A BTree is a collection of nodes which link together in a tree topology such that there is a Root node which links to child nodes which in turn link to more child nodes until a Leaf node is reached after H links where H is the height (or number of levels) of the tree. A particular node’s location in the tree is based on which node links to it, and the node links are arranged such that there is a total ordering of records and only 1 valid location for any particular record.

In Redwood, removal of a large data range means unlinking its subtree equivalent and placing the blocks which held the nodes of that subtree onto a free block queue which is popped from when space for a node is needed such as when the BTree grows.

As a concrete example, if you delete 1GB of records between keys A and B, your data file will not shrink but the space those records used will exist on an internal free list. If you then insert 1GB of records between keys X and Y, it will reuse the previously freed space so your data file will not need to grow.

The reason the data file does not shrink when it contains a lot of free space is that the free space is generally located throughout the file (aka “fragmented”) so there is no cheap way to truncate the file as first the free space must be consolidated to the end of the file.

2 Likes