Is it possible to enable `perpetual_storage_wiggle` at the 'shard' level instead of the 'process' level?

Our engineers have just deployed a data pattern that we’ve not seen before, where they create a very large amount of new data in FDB on a periodic basis, and then clean up old versions of that data. It takes a comparatively long time to build a stable view of the data, and handling mutations to the ‘live’ dataset is complex and error prone, so they build a completely new dataset under a new prefix each (day, week, whatever), adjust a single key which tells the rest of the system which copy to use once it’s stable, and then delete the entire prefix of the old version.

What we are seeing, when this feature is turned on, is our disk usage just constantly increasing even though they are supposedly deleting old prefixes. My investigations of this led me to Data retention after deleting a key range using SSD engine - #6 by gaurav, where I found out that neither ssd-2 (what we were using until a few months ago) or ssd-redwood-1 (what we are using now, on FDB 7.3) clean up old data on disk by default. The data is retained in the on-disk B+Tree, and only actually compacted/deleted when the tree is completely rebuilt as part of the process being excluded.

Based on that, enabling perpetual_storage_wiggle=1 storage_migration_type=gradual should help with our erroneous data retention issues, as to my understanding (including when we enabled it for our ssd-2ssd-redwood-1 migration) is that it effectively excludes a storage process at a time, allows it to empty and completely wipe it’s disk, and then re-includes it to migrate data back via normal rebalancing. As the data is migrated away from the process (and the same or other data later migrated back to it) it is compacted.

The problem is that this means we have to maintain a very high overhead of ‘available’ disk. In our dev clusters in particular, we don’t need high performance but we do have quite a lot of data. So we only have 2 storage processes, each on their own EC2 instance, per replication boundary (we’re running in three_data_hall mode, and our replication boundary is the AWS AZ). That means that each process needs to maintain >50% free disk space to be able to hold the entirety of the data from the other process when it is wiggled.

I’m also not clear on what would happen to the temporarily-moved data after the wiggle was complete. If process A is wiggled, and all it’s data is migrated to process B such that process B is now, say, 80% full on disk, then process A wipes it’s on-disk copies and is re-included so that the K/V data is redistributed between both processes again, will process B remain at 80% disk usage because it hasn’t been completely excluded? So with only 2 storage processes per replication boundary we’re effectively just bouncing that very high usage between them as they are wiggled? Do we need to move to a pattern of more horizontal scaling where we have more processes in a replication boundary such that when one is wiggled the data from it is distributed between the others?

In Redwood: perpetuum moving data between storage servers - #8 by SteavedHams @SteavedHams does say that

These decisions make sense in the context of a production FDB cluster with dedicated data volumes, as giving space back to the filesystem is unnecessary (it will be reused within Redwood) and FDB Data Distribution will move/balance key ranges around between Storage Servers which will effectively remove BTree slack and keep logical space usage roughly balanced.

But surely that’s only true when the data being moved/rebalanced sits in the same place in the BTree? If FDB detects a ‘hot range’ or similar and decides to split it into 2 shards, one of which it moves to another storage process, then it’s quite unlikely that anything else will move onto this process and take up the same ‘space’ in the tree?

Aah, from Maximum file size on ext4 - #3 by SteavedHams

One reason for this is that FDB shuffles data around a good bit in response to writes, and any time the cluster relocates a data range from one Redwood instance to another it is effectively “compacted” as the destination will have a condensed low-slack (or even negative slack) subtree of that data and the source will delete the subtree completely which frees all of its blocks.

So if a shard is moved the entire subtree moves, so disk space can be (and is) reclaimed on the origin process?

I can’t answer the wiggle part, but we were facing a similar issue, (workload that continously inserts new data, but removes old data at the tail) that turned out to be related how redwood handles page slack on page splits.

This is true, but when you wipe the data, the deleted pages should be immediately available to store new data, even though they are not returned to the OS.

You can validate this by opening up the trace logs from a storage server and look for StorageMetrics events, where KvstoreBytesFree will be smaller than KvstoreBytesAvailable. I think this is also exposed in status json for the processes.

It is easiest to validate when there is a log and and a storage role on the same disk, the size avilable for the log will reflect the OS free space, but the size available for the storage will include the deleted (now free) pages.

My understanding is that if you have 100GB of old data, add 100GB of new data, then your OS will report 200GB used data, when you delete the old set, this won’t change. But when you insert the next batch, it should be able to use the free pages, and disk usage should not grow too much beyond 200GB

1 Like

Wow. That is a very useful thread. I have a lot of stats to go and fetch and several knobs to try tweaking based on the output. Thanks very much!

No, this is not the case. I’m not sure what your mental model of a BTree is but there is no design I am familiar with where freed nodes are restricted to only containing data within the range of values which they previously stored.

A BTree is a collection of nodes which link together in a tree topology such that there is a Root node which links to child nodes which in turn link to more child nodes until a Leaf node is reached after H links where H is the height (or number of levels) of the tree. A particular node’s location in the tree is based on which node links to it, and the node links are arranged such that there is a total ordering of records and only 1 valid location for any particular record.

In Redwood, removal of a large data range means unlinking its subtree equivalent and placing the blocks which held the nodes of that subtree onto a free block queue which is popped from when space for a node is needed such as when the BTree grows.

As a concrete example, if you delete 1GB of records between keys A and B, your data file will not shrink but the space those records used will exist on an internal free list. If you then insert 1GB of records between keys X and Y, it will reuse the previously freed space so your data file will not need to grow.

The reason the data file does not shrink when it contains a lot of free space is that the free space is generally located throughout the file (aka “fragmented”) so there is no cheap way to truncate the file as first the free space must be consolidated to the end of the file.

3 Likes

I think the answer to that is simply “an incorrect one”, or at least “one with some serious understanding gaps about the underlying implementation” :wink:

I was thinking that for speed of reads across ranges the data was actually being structured on disk so that information in ‘close’ nodes of the tree was also close on disk. I realise now that’s complete rubbish. It might have given some read performance improvement on spinning rust, but would have necessitated moving a lot more data whenever the tree needed to expand, so would have been horrendous for writes. And it wouldn’t give any improvement on solid-state. So I was just… very, very wrong.

Anyway, thank you for your clarification. We might have an issue with too much slack, I still have to confirm that, but it’s likely that the new service is just writing a lot more data than we expected from prior calculations.

I was pretty sure of that part :slight_smile: I couldn’t tell where the gap was so I figured I’d just cover everything from scratch.

You are correct that it’s impractically expensive to maintain block level record adjacency, and even if it were done the physical data locations inside an SSD would be quite different due to how flash storage works so it wouldn’t necessarily help.

FWIW, adjacent Leaf nodes in Redwood can tend to be adjacent in the data file under the right conditions. When a Redwood instance is initially loaded with data via many batched chunks of sequential records (such as with FDB’s normal data movement), the data file will grow as new ranges of records are added to the end so many of the Leaf nodes in those ranges will be adjacent. Random edits will fragment this over time. However, since block re-usage in Redwood is often in FIFO order it is possible that a write pattern of mostly inserts and clears of non-trivial batches of adjacent records would result in maintaining a significant amount of Leaf node adjacency within the data file.

But why would this matter, given the second paragraph? Redwood often reads or writes sibling Leaf nodes in parallel, so if those result in adjacent block requests on a block device then the linux kernel can merge them into a single IO request to the disk. This yields a small overhead reduction, but more importantly if the block device is on a block service such as EBS then this merging reduces the number of IOPS which can save cost since IOPS is usually a billing dimension.

You can measure this reduction by looking at the Type=ProcessMetrics trace log events, where File[Writes|Reads] is the number of logical IO operations done on the file for the Elapsed period and Disk[Writes|Reads] is the number of operations the OS has completed to the block device during that period.

1 Like

Ooh, very interesting. Thank you!

Yes, we had noticed as we went from 7.1 ssd-2 → 7.3 ssd-2 that we could achieve noticeably higher performance (push more iops) from a single fdbserver process running the storage role, which we thought was related to the changes made to parallelise disk calls (? I’m trying to remember what I read in the upgrade notes) rather than waiting for an individual fetch call to return before issuing the next one. My recollection is that the older system from 7.1 was very disk latency sensitive, which made it not-great with remote attached storage like EBS.

When we then went from 7.3 ssd-2 → 7.3 ssd-redwood-1 we saw a pretty massive drop in reported iops usage for the same level of performance, which was also really nice for potential future cost savings :tada:.

The 7.1 to 7.3 change is likely due to the new streaming range read feature, which data movement in the cluster can use. It eliminates the round trip latency between consecutive range read result chunks for large range reads by fetching the next several (up to 20 I think) chunks of the key range (1 or 2 MB each) on the server side in parallel, and then returns them without waiting for the client to request them. This is a Storage Server feature so it works with any storage engine. Beyond this, Redwood will further boost range read performance because it can parallelize the read ops of sibling BTree nodes within the different key range chunks read by the Storage Server whereas the ssd-2 engine will do a serial traversal of the range.

1 Like