Data retention after deleting a key range using SSD engine

A slightly random question here. After deleting a key range within FDB, I’d like to understand how long the data remains on disk for. By my understanding, the SSD engine is a pretty vanilla SQLite database.

By default, from what I can see from the SQLite documentation, SQLite will leave data on disk until it reuses the pages. Related to this, SQLite has the secure_delete pragma to overwrite deleted data with zeros, https://www.sqlite.org/pragma.html#pragma_secure_delete, and it also states that one could VACUUM to gain similar benefits in terms of clearing old data.

Looking through the FoundationDB code at https://github.com/apple/foundationdb/blob/master/fdbserver/KeyValueStoreSQLite.actor.cpp, what I see is:

  • I can’t see that secure_delete PRAGMA is set.
  • I don’t think there is a VACUUM after each write.

This seems like a pretty sensible default way to work, frankly.

I can see a lot of code related to “spring cleaning” in KeyValueStoreSQLite.actor.cpp, however, which I can’t really follow sadly. Perhaps this is key here as it appears to be concerned with periodic VACUUMing of the database. The auto_vacuum=2 set on creating database files seems to mostly be about enabling a more efficient incremental vacuum.

So my basic question is whether I’m reading the code right about the behaviour that FoundationDB is setting up on its files. Also, what does the spring clean code do, and how do the various knobs mentioned in the code affect it. Broadly, there are two main scenarios I am considering:

  • Where we delete a key range but never touch the key ranges stored in that SQLite file again (modulo rebalance operations I guess).
  • Where we are regularly doing writes to key ranges that end up in that SQLite file.

Or am I totally barking up the wrong tree here :slight_smile: Anyway, thanks for making your way through the post and any light you can shed here.

Few of your queries are answered in the below discussions

Auto vacuuming enables incrementally relocating free pages to the end of the page file so the space can be truncated to exclude those pages. In contrast, a full vacuum cycle rebuilds the btree to use a minimal space footprint, which effectively rewrites all pages so no previously unreferenced page content will be included in the result.

If you are worried about forensic recovery of cleared KV pairs, then I think at a minimum you would need to set the secure_delete pragma. Assuming your filesystem zeroes blocks on disk that have been truncated off of a file, vacuuming off wholly unused pages should be wiped clean, however range clears also likely start or end in the middle of a page and only clear part of the page. In such pages, the deleted record bytes will still be present but unreferenced until a new record uses the space, so without further modification the removed bytes will still exist for forensic discovery.

I have to point out, however, that our SQLite source is actually not vanilla, it’s been modified a good bit, and it is possible that in our modifications we weakened or broke the secure_delete contract.

1 Like

:ok_hand: Thanks for the details, I think I have enough for now. Essentially, yes, the question was around how long relatively easy to retrieve traces of data remain on disk.

I guess the obvious question I missed: is the new Redwood engine far enough along that it is reasonable to ask how it handles data deletion and length of time deleted items remain easily recoverable?

Currently in Redwood, deleted data or parts of it could remain recoverable indefinitely because

  • Free pages within a Redwood file are not returned to the filesystem due to the additional IO necessary to maintain the required metadata to do this.
  • Within a page, deleted records are initially marked as deleted with a single bit changes, though a later page contents rebuild (to reclaim freed bytes or for other reasons) will remove the deleted record bytes.
  • Since pages in Redwood are never modified in-place, multiple copies of a page with deleted record bytes can exist in pages retained as part of the retained version interval or in pages that are in the free list.
  • Redwood is a B+Tree (SQLite is a B-Tree) so records above the leaf level are not user KV pairs but rather minimal length boundary keys used for traversal decisions. This means that given the user KV pairs
    Aaaa -> 1342 Bbbb -> 213 Cccc -> 12344
    then the string B could be used as a minimal boundary key, it would exist in an internal page, and deleting the record with a key of Bbbb would not remove the boundary key so part of the deleted key is still recoverable. This is a simple example but it can also happen with longer shared prefixes that cross page boundaries, where after removing all records with a certain shared prefix that shared prefix plus one byte is still present as a boundary key in an internal BTree page.

It would be possible to add options to always rebuild page contents when a record is being deleted, and to explicitly write all zeroes to freed pages. However, this would not solve the boundary key “leak” describe above.

I also want to point out that overwriting a block on an SSD would not normally overwrite the same physical storage that held the old data, nor would it erase the old copy until some unspecified point in the future which is drive logic depending.

1 Like

Is this a fixed design constraint for Redwood, or something that would be relaxed later on?
There are a lot of real scenarios we have come across where extra data got created due to unforeseen situations/bugs, which gets cleaned up later.

Even if there was a manual compaction command/tool, it might be useful.

There are several ways that file shrinking support could be added, with various performance tradeoffs, and likely an on-disk format change. A format change just means a migration to the new storage engine format must be done, but that is a process which already has to work smoothly so that current production deployments can migrate to Redwood without service interruption.

For a manual compaction, such as after a temporary space usage increase, there is actually already a way to make this happen: For each host’s storage processes (or for smaller host count clusters then for each storage process), exclude the storage process(es), wait for data distribution to move data away from the host, then re-include the storage process(es). This will effectively rebuild compact BTree’s of all of the data.

What I’ve just described is essentially a storage engine migration from one storage engine to the same storage engine, as the steps given is how storage engine migration already works. It’s just that currently the actual storage engine migration logic does not let you do this. It will be supported in some form, possibly as noted here: https://github.com/apple/foundationdb/issues/2640 or some other way.

A compaction tool is possible but not trivial as it must work on a live storage engine. It is possible we would create one if there is a strong enough need for it. The storage engine self migration method is not particularly IO or network efficient compared to a custom tool, but on the other hand it is essentially already written and its performance impact is known, which is essentially the same as a maintenance-driven host/process exclusion.

1 Like

Thanks for the detailed reply.

1 Like

Fantastic reply; thank you for the detail.