Designing an expiring key/value store

(I previously asked about key expiry here – this is a different, but related, use case.)

I’m considering FDB for a use case with the following properties:

  • Simple key-value storage
  • Keys are 8-byte hashes
  • Values are large (mean 5k) binary blobs
  • Each key/value pair expires after T time has passed (perhaps 24h)
  • Keys are never deleted apart from expiration
  • Writes to existing keys must be rejected
  • The rate of writes is fairly high (10k/sec)
  • The rate of reads is low (500/sec)
  • Most reads are either for nonexistent keys or for recently written keys

Option A would be to store the values with the expiry keys:

(timestamp, key) -> value
(key) -> timestamp

Option B is to store the values with the primary keys:

(timestamp, key) -> ""
(key) -> value

I was leaning toward A because then the data that expires at the same time should be sequential, so I would think that writes and deletes would be more efficient. But on the other hand, does this concentrate lots of writes on the same FDB process?

With B reads only have to consult the primary key, not the expiry key, but given the high write:read ratio I don’t care about optimizing for reads. (Note that writes have to consult the PK in either A or B because of the need to detect and reject duplicate keys.)

Should I go with A or B or something else entirely?

Is FoundationDB a good fit for this use case? I think the common wisdom is that B-tree based DBs are best for read-heavy workloads and you want an LSM for write-heavy workloads, but it seems that FDB should be okay for this given enough machines with fast SSDs.

Do you care about making keys appear expired in the seconds to minutes between the exact moment of expiration, and the time when some garbage collection process can clean those records up? If you don’t, option B (or option B with the key spread around the timestamp to spread out writes) seems ok; depending on the rate of rejected writes you might be pulling down a lot of 5k blobs though.

If so, you’d need an efficient way to go from key to timestamp, which eliminates option B.

I don’t claim any particular expertise, but given your description, I would try to do something like:

'k', key -> value
't', key -> timestamp
's', key[0], timestamp, key[1:7] -> empty

To read you pull the key from the k and t prefixes, if the timestamp is old, you can return nothing.

To write you check for the key in the t space, and can abort or overwrite the old value as appropriate. Writing fills out all three keys, with the s space splitting the key up on either side of the timestamp, to spread the writes out across processes.

Garbage collection can proceed in parallel by searching each of the 256 subspaces of s for timestamps that have expired. Expiration becomes just a performance/space optimization, as the reads/writes can still function and efficiently implement your expiration even if the garbage collector never runs.

(On further consideration, maybe your option A with the key splitting trick would be fine?)

1 Like

I was planning on checking the timestamp and returning “does not exist” for expired-but-not-deleted keys.

(I didn’t write it down, but I was planning on storing the timestamp with the value in option B; it’s necessary to have the timestamp available for all reads.)

Thanks for the key spreading idea – that’s clever.

I feel like A plus key-spreading is the best contender given what I’m assuming and what you wrote. B is even simpler, though – I wonder what the performance hit of not consolidating values-to-delete-together is. I’ll probably have to benchmark.

Thanks for the feedback!

What happens if the system clock of the garbage collecting process shifts far in the future, or the past? It would either instantly wipe all newly inserted keys, or never garbage collect them. Even with NTP properly set up, a single mistake by someone or something on one of the physical host for your process could wipe all data (I have recently observed some event log entries in one of my servers that are dated December 12th of 2018, and have NO idea how they got there)

EDIT: and the reverse is also true: if the clock of one of the processes that writes keys would have drifted in the future, you’ll end up with a batch of keys that would stick around for far longer than 24h (in my case above, it would have lasted a few months).

Using VersionStamps, or read/commit versions to extrapolate a notion of “time elapsed since” could maybe help solve this? (as long as you don’t need to-the-second precision but “about 24h more or less”)

FDB can do okay at this. You’re looking to drive ~50MB/s of write load, which I’ve seen FDB do.

Your workload description does hit the exact checkboxes of:

  • Significantly more writes than reads
  • All keys are independent and uncorrelated
  • Ordering of keys is unimportant

That largely means most of FDB’s features aren’t actually interesting or helpful for you. If you’re already running a Cassandra/HBase/Riak/etc.-style database internally, then this use case would fit well on one of those. But either way, you should probably just optimize for whatever is operationally the most sane for you to do, so if you’re already mainly running FoundationDB, then put the use case on FDB for the operational sanity.

I would recommend spreading the load across the key space. Most ordered databases struggle when every key you write is strictly greater than every key previously written. FDB less so than others, but it still makes trying to evenly distribute load across the cluster rather difficult.

Sometimes one can push garbage collection work to clients. Assuming your keys are well distributed, if instead of looking up a key in 't' to test its existence, you do a prefix range read of the first 3 bytes of the key in 't', then you’ll have a range read cover every key in the database about five times a day (so expected 20% overhead for storing to-be-expired values). If you write your client to also asynchronously expire other keys in the database when it does a write, then you don’t need a second garbage collection process.

( According to birthday paradox, You’d expect to two keys to land in the same 256^3 buckets once reaching sqrt(2^8^3) = ~5000 values. )

I think this still doesn’t protect you from significant time drift between servers. I think you could see a large version jump if a recovery moved the master from a machine behind in time to one forward in time. Repeated recoveries would accelerate the passage of time, though probably on just O(minutes) realistically.