Considerations for number of Key-Vals in the cluster

gaurav · August 17, 2018, 5:50am

Hi,

I am planning to model a metric time series layer on top of FDB, and this layer can create potentially a few Trillion rows in the system. So I wanted to check if there is any aspect I should consider before starting out?

Most of the data will be cold data and I will try to model the data so that time_bucket is prefixed to the key (in addition to timestamp being suffixed), to try to produce ranges that become immutable once the time_bucket for that range has become old. This is being done to ensure that the data that has become cold does not incur more churn of any kind. Something like:

coarse_time_bucket/series_key/timestamp -> values

Is there any limitation on the absolute number of KV pairs that can be stored in the system ? Assuming that there is enough disk storage in the cluster to hold KV pairs, are there any other resource requirements that grow proportional to number of KVs (like a memory map to hold location of keys etc.)?

–
thanks,
gaurav

alloc · August 21, 2018, 5:58pm

Hm, well, a little. The primary metric we usually use when evaluating database size is how many key and value bytes are in the database rather than the number of keys. I suppose there are a few things like the byte sample (used by data distribution) that perform worse for a cluster of a given size the larger. There is also some amount of per-key overhead that is needed at the BTree level that you will have to pay as well.

The other thing that might not be obvious is that if your keys share a common prefix (which it sounds like they do), then you might end up “wasting” space on the common prefix (as the storage layer does not do any prefix compression).

Having said that, I would probably think about this mostly in terms of the amount of bytes rather than the number of keys. Maybe I’d estimate the overhead factor to be slightly higher. (On a triple replicated cluster, you expect something like 4x storage overhead–1 factor each for each replica and then another 1 for the per-key overhead.)

gaurav · August 22, 2018, 2:00pm

Thanks for the reply. I plan to have some level of indirection for mapping series_key to a long id to overcome the repetition overhead.

I was checking here primarily if there are any in-memory overheads to hold any kind of data-structures that grow linearly (and have non-trivial cost) as the number of key-values grow.

Topic		Replies	Views
Considerations for key and value sizes Using FoundationDB	2	2077	November 28, 2018
Storing one billion floats with dense keys Using FoundationDB	1	471	April 23, 2019
Getting the number of key/value pairs Using FoundationDB	11	6245	April 17, 2021
Design document of internals & storage? FoundationDB Core	4	2036	April 20, 2018
Designing an expiring key/value store Using FoundationDB	4	2602	October 11, 2018

Considerations for number of Key-Vals in the cluster

Related topics