I’m considering FDB for the following use case:
- A key/value store with short binary keys and moderately-sized values (<1KB)
- Operations are mostly point queries by key (read and upsert)
- Key/value pairs expire and are removed after some time (weeks/months). Every update logically bumps the TTL for the record.
1 and 2 seem to fit very naturally into FDB. The interesting part is how expiration should work.
For expiration, it’s not critical that keys are deleted immediately when they expire; we’re more concerned with the overall database size and ensuring that the overhead from expired data isn’t a large fraction of the total size.
Here are two approaches that occurred to me:
Approach A: a background process iterates all keys and clears the expired ones
We store an expiration timestamp inside each value. Every value update also needs to replace the expiration with [now + expiration].
We have a background expiration process (or processes) which continually iterate through all the keys using range reads. When this process locates keys that have expired, it deletes them.
Approach B: organize keys together by when they expire
There are a few different ways to go about this; here’s one way.
Suppose we want 1 month expiry. We can group keys by expiration day. We’ll need a level of indirection to go from the original key to the appropriate day-prefixed key:
2018-05-01:key1 -> val1 2018-05-01:key3 -> val3 ... 2018-05-02:key2 -> val2 ... key1 -> 2018-05-01 key2 -> 2018-05-02 key3 -> 2018-05-01 ...
- For every update, if the expiration day isn’t [today + 1 month], then we need to move the key to the current expiration date (i.e., create a new key for [now + 1 month] with the value; clear the old date-prefixed key; update the date associated with the key).
- To delete old data, do a range query to get all the expired indirection keys and delete them one-by-one. Then do a single range clear to delete the expired keys/vals.
Though now that I’ve written that out, it sort of seems like B is doing all the work of A over and over again as it moves the keys to later expiration days (our key sets are fairly stable). Perhaps there’s some other way to do less work will still being able to delete the bulk of the data with single range clear.
Do either of these make sense? Is there another, better approach to consider? What are the performance implications of deleting lots of keys one-at-a-time? What about space overhead/reclamation?