Our engineers have just deployed a data pattern that we’ve not seen before, where they create a very large amount of new data in FDB on a periodic basis, and then clean up old versions of that data. It takes a comparatively long time to build a stable view of the data, and handling mutations to the ‘live’ dataset is complex and error prone, so they build a completely new dataset under a new prefix each (day, week, whatever), adjust a single key which tells the rest of the system which copy to use once it’s stable, and then delete the entire prefix of the old version.
What we are seeing, when this feature is turned on, is our disk usage just constantly increasing even though they are supposedly deleting old prefixes. My investigations of this led me to Data retention after deleting a key range using SSD engine - #6 by gaurav, where I found out that neither ssd-2
(what we were using until a few months ago) or ssd-redwood-1
(what we are using now, on FDB 7.3) clean up old data on disk by default. The data is retained in the on-disk B+Tree, and only actually compacted/deleted when the tree is completely rebuilt as part of the process being excluded.
Based on that, enabling perpetual_storage_wiggle=1 storage_migration_type=gradual
should help with our erroneous data retention issues, as to my understanding (including when we enabled it for our ssd-2
→ ssd-redwood-1
migration) is that it effectively excludes a storage process at a time, allows it to empty and completely wipe it’s disk, and then re-includes it to migrate data back via normal rebalancing. As the data is migrated away from the process (and the same or other data later migrated back to it) it is compacted.
The problem is that this means we have to maintain a very high overhead of ‘available’ disk. In our dev
clusters in particular, we don’t need high performance but we do have quite a lot of data. So we only have 2 storage processes, each on their own EC2 instance, per replication boundary (we’re running in three_data_hall
mode, and our replication boundary is the AWS AZ). That means that each process needs to maintain >50% free disk space to be able to hold the entirety of the data from the other process when it is wiggled.
I’m also not clear on what would happen to the temporarily-moved data after the wiggle was complete. If process A is wiggled, and all it’s data is migrated to process B such that process B is now, say, 80% full on disk, then process A wipes it’s on-disk copies and is re-included so that the K/V data is redistributed between both processes again, will process B remain at 80% disk usage because it hasn’t been completely excluded? So with only 2 storage processes per replication boundary we’re effectively just bouncing that very high usage between them as they are wiggled? Do we need to move to a pattern of more horizontal scaling where we have more processes in a replication boundary such that when one is wiggled the data from it is distributed between the others?