I have read many posts on how to count the keys in a key range, including the Java code that used offsets to skip through key values (from 2018). After reading more of the documentation, it does not seem the right way to do things.
We are using the Rust implementation of foundationdb-rs.
The plan now is to simply set a LIMIT size (like 100,000), get the first chunk with the first/last key. Then use the last key in the result to start the next chunk. With get_streams we can simply loop through the keys and aggregate the count of each. By using a LIMIT we can avoid the 5s timeout, and with threads we can have several threads go at once, reading chunks simultaneously.
We do not use the RecordLayer, this is just direct in FDB using our own keys.
Yes, that will work. Note that you’ll be downloading the entire KVs, not only the count. Also note that you may miss KVs if another process is writing to a range which you previously scanned.
Since you have mentioned parallelism, I would also recommend checking out “get range split points”. This will provide you with a list of boundary keys, so each range is of roughly wanted size (should be over 3-5MB), so if you for example ask for [A, B) it will give you the following array:
A
A00149
A14934
A35831
A48571
B
Then you can use these to create multiple ranges, such as A-A00149, A00149-A14934, …, A48571-B.
This is especially useful if the keys are not distributed evenly.
get_range_split_points is indeed really useful to get a list of boundary keys, and it is one of the latest method I added to the rust bindings. There is also get_estimated_range_size_bytes that can be useful.
in my company, when we know that we will need to count either bytes or number of records or number of rows, we are wrapping some statistics logic in a dedicated subspace. The statistics are maintained through atomic operations, much like the record-layer.
Great to know. I see that I am on version 0.8.0 and looks like this was added in 0.10.0, if the later version works with our version of foundationdb I will upgrade. We need the count for exactly the purpose, to split a set of keys for parallel processing. We will try this.
If your KV pairs are on average the same size (or you are using size as a measure), and you do not have an exact batch size requirement, I’d recommend using the output of get_range_split_points directly. With small datasets, the small amount of unevenness will not matter. With large datasets, it will average out to being roughly equal.