Data retention after deleting a key range using SSD engine

mikerhodes · June 18, 2020, 4:06pm

A slightly random question here. After deleting a key range within FDB, I’d like to understand how long the data remains on disk for. By my understanding, the SSD engine is a pretty vanilla SQLite database.

By default, from what I can see from the SQLite documentation, SQLite will leave data on disk until it reuses the pages. Related to this, SQLite has the secure_delete pragma to overwrite deleted data with zeros, https://www.sqlite.org/pragma.html#pragma_secure_delete, and it also states that one could VACUUM to gain similar benefits in terms of clearing old data.

Looking through the FoundationDB code at https://github.com/apple/foundationdb/blob/master/fdbserver/KeyValueStoreSQLite.actor.cpp, what I see is:

I can’t see that secure_delete PRAGMA is set.
I don’t think there is a VACUUM after each write.

This seems like a pretty sensible default way to work, frankly.

I can see a lot of code related to “spring cleaning” in KeyValueStoreSQLite.actor.cpp, however, which I can’t really follow sadly. Perhaps this is key here as it appears to be concerned with periodic VACUUMing of the database. The auto_vacuum=2 set on creating database files seems to mostly be about enabling a more efficient incremental vacuum.

So my basic question is whether I’m reading the code right about the behaviour that FoundationDB is setting up on its files. Also, what does the spring clean code do, and how do the various knobs mentioned in the code affect it. Broadly, there are two main scenarios I am considering:

Where we delete a key range but never touch the key ranges stored in that SQLite file again (modulo rebalance operations I guess).
Where we are regularly doing writes to key ranges that end up in that SQLite file.

Or am I totally barking up the wrong tree here Anyway, thanks for making your way through the post and any light you can shed here.

tuk · June 18, 2020, 4:51pm

Few of your queries are answered in the below discussions

SteavedHams · June 18, 2020, 8:33pm

Auto vacuuming enables incrementally relocating free pages to the end of the page file so the space can be truncated to exclude those pages. In contrast, a full vacuum cycle rebuilds the btree to use a minimal space footprint, which effectively rewrites all pages so no previously unreferenced page content will be included in the result.

If you are worried about forensic recovery of cleared KV pairs, then I think at a minimum you would need to set the secure_delete pragma. Assuming your filesystem zeroes blocks on disk that have been truncated off of a file, vacuuming off wholly unused pages should be wiped clean, however range clears also likely start or end in the middle of a page and only clear part of the page. In such pages, the deleted record bytes will still be present but unreferenced until a new record uses the space, so without further modification the removed bytes will still exist for forensic discovery.

I have to point out, however, that our SQLite source is actually not vanilla, it’s been modified a good bit, and it is possible that in our modifications we weakened or broke the secure_delete contract.

mikerhodes · June 19, 2020, 1:22pm

Thanks for the details, I think I have enough for now. Essentially, yes, the question was around how long relatively easy to retrieve traces of data remain on disk.

I guess the obvious question I missed: is the new Redwood engine far enough along that it is reasonable to ask how it handles data deletion and length of time deleted items remain easily recoverable?

SteavedHams · June 19, 2020, 9:19pm

Currently in Redwood, deleted data or parts of it could remain recoverable indefinitely because

Free pages within a Redwood file are not returned to the filesystem due to the additional IO necessary to maintain the required metadata to do this.
Within a page, deleted records are initially marked as deleted with a single bit changes, though a later page contents rebuild (to reclaim freed bytes or for other reasons) will remove the deleted record bytes.
Since pages in Redwood are never modified in-place, multiple copies of a page with deleted record bytes can exist in pages retained as part of the retained version interval or in pages that are in the free list.
Redwood is a B+Tree (SQLite is a B-Tree) so records above the leaf level are not user KV pairs but rather minimal length boundary keys used for traversal decisions. This means that given the user KV pairs
Aaaa -> 1342 Bbbb -> 213 Cccc -> 12344
then the string B could be used as a minimal boundary key, it would exist in an internal page, and deleting the record with a key of Bbbb would not remove the boundary key so part of the deleted key is still recoverable. This is a simple example but it can also happen with longer shared prefixes that cross page boundaries, where after removing all records with a certain shared prefix that shared prefix plus one byte is still present as a boundary key in an internal BTree page.

It would be possible to add options to always rebuild page contents when a record is being deleted, and to explicitly write all zeroes to freed pages. However, this would not solve the boundary key “leak” describe above.

I also want to point out that overwriting a block on an SSD would not normally overwrite the same physical storage that held the old data, nor would it erase the old copy until some unspecified point in the future which is drive logic depending.

gaurav · June 20, 2020, 3:51am

Is this a fixed design constraint for Redwood, or something that would be relaxed later on?
There are a lot of real scenarios we have come across where extra data got created due to unforeseen situations/bugs, which gets cleaned up later.

Even if there was a manual compaction command/tool, it might be useful.

SteavedHams · June 20, 2020, 5:20am

There are several ways that file shrinking support could be added, with various performance tradeoffs, and likely an on-disk format change. A format change just means a migration to the new storage engine format must be done, but that is a process which already has to work smoothly so that current production deployments can migrate to Redwood without service interruption.

For a manual compaction, such as after a temporary space usage increase, there is actually already a way to make this happen: For each host’s storage processes (or for smaller host count clusters then for each storage process), exclude the storage process(es), wait for data distribution to move data away from the host, then re-include the storage process(es). This will effectively rebuild compact BTree’s of all of the data.

What I’ve just described is essentially a storage engine migration from one storage engine to the same storage engine, as the steps given is how storage engine migration already works. It’s just that currently the actual storage engine migration logic does not let you do this. It will be supported in some form, possibly as noted here: Support for storage engine parameters in configuration string · Issue #2640 · apple/foundationdb · GitHub or some other way.

A compaction tool is possible but not trivial as it must work on a live storage engine. It is possible we would create one if there is a strong enough need for it. The storage engine self migration method is not particularly IO or network efficient compared to a custom tool, but on the other hand it is essentially already written and its performance impact is known, which is essentially the same as a maintenance-driven host/process exclusion.

gaurav · June 20, 2020, 11:49am

Thanks for the detailed reply.

mikerhodes · June 22, 2020, 12:07pm

Fantastic reply; thank you for the detail.

Topic		Replies	Views
Shards are not splitted into smaller ones Using FoundationDB	12	2631	August 16, 2021
Discussion thread for new storage engine ideas Development	30	15570	February 18, 2020
Cannot clear all keys by "clearrange \x00 \xff" Using FoundationDB	15	2935	July 3, 2022
Can't clear database (delete all data) Using FoundationDB performance	1	1281	February 19, 2022
Used disk space dramatically increases while sum of key-value sizes is constant Using FoundationDB	5	2056	September 4, 2018

Data retention after deleting a key range using SSD engine

Related topics