Getting the number of key/value pairs

(Michael) #1

Sorry if this is something obvious but i have been looking for a way to quickly get the total number of keys / values in the database. I know status shows total size and utilization of the database but is there a simple way to get the total number of key value pairs?

Limiting the cardinality of a key range
Missing API for getting just the count of a key range?
(Evan Tschannen) #2

I will start by saying we do not have an easy way of getting this information.

To get an exact number, your best bet is to count the number of key value pairs as you insert and delete them using atomic operations. You may want to have multiple keys that are counting the key value pairs in different ranges to avoid having one really hot key in your database.

For an estimate, your best bet is probably to use the locality API to get shard boundaries You could read a few key value pairs from each shard to estimate the average size of a key value pair. You could then divide the size estimate by the estimated average shard size. I think we have a simple python script that implements this approach that I might be able to track down.

(Michael) #3

Thanks for the advice, ill look at doing that for exact counts moving forward. We are still testing so exact counts are not yet required. I was able to get an accurate enough count using the shard boundaries given knowledge of how the keys were generated.

(Alec Grieser) #4

I will second the first part of Evan’s response, which is that keeping track of the count using our atomic mutation API is probably your best bet in the long run, as it will be much faster to query that information (and more accurate). We make use of them a little in our multi-map example, but you can transactionally update the “count” key while adding keys and values to your data (and subtract it when you remove things), and that should work pretty well. As Evan points out, you may need to have multiple keys that you spread the load around on (and read from all of them and sum), but my guess is that even if you need to do that, that will wind up serving you better in the long run.

(Christophe Chevalier) #5

If you really need to brute force your way in counting the number of keys and don’t need an exact result (one example is a tool that can inspect the content of any database), then there is a way to do it using key selectors and offset, THOUGH THIS CAN BE VERY SLOW IN SOME CASES!

You can find an example of this here. Basically, it does a GetKey(CURSOR + offset) using a pre-defined “window size” which is added as an offset. If the key returned is still in the range we want to count, we continue scanning. If it is outside, then we reach the final stage where we can simply count the remaning key/values themselves using a final GetRange.

The advantage is that it can count large chunks without having to download all keys/values over the wire, (except the very last chunk). The disadvantage is that the count will not be transactionally accurate: you need multiple transactions to count, and the content of the keyspace can change while you are counting it. (hence the name EstimateCountAsync in the method in the code linked above).

But there are some pathological cases that can make this VERY SLOW:

  • Ranges with a lot of very small keys, this can take a while to traverse because of the window size.
  • Ranges with large keys are also slow to traverse, because I suspect the database as to manually scan all the keys anyway.

The Key Selector resolution alogorithm used by the database does not like large offsets, and is probably O(N) internally. This means that you cannot go crazy with the window size, and will probably be stuck at a certain speed of keys counted per second anyway. Will not fly with millions or billions of keys.

Like explained above, if this is a required feature of your application or algorithm, you should bake the counting inside the Layer itself, using a dedicated counter key and atomic increments. If you want something generic for a tool, then you can use this technique if you know you are not in the situations above.

(David Scherer) #6

I’ll add that

(1) it would be possible for a storage engine to offer O(log N) counting and key selector offsets (this would require extensions to the IKeyValueStore interface. it would be more straightforward if we move to a MVCC interface to storage)

(2) if an application or layer needs these operations (or related ones like picking a key uniform randomly or determining the “rank” of a value in some index) to be fast for some specific data it can make them reasonably efficient via (somewhat tricky) data modeling. There used to be a released python layer RankedSet that demonstrated this; I can’t remember how good the implementation was. Basically the approach is to store, besides, the data, a number of increasingly sparse samples of the data (say, 1/100 of keys, then 1/10000 of keys, etc) and with each of these samples an accurate, atomically incremented counter of the number of keys between it and the next sampled key at the same level. When adding or removing data you update all these (with careful use of snapshot reads and manual key ranges to reduce conflicts), and when you want to count or offset from a particular point. You can “count” whatever metric or metrics you want rather than FoundationDB k/v pairs specifically.

Missing API for getting just the count of a key range?
(Sam Pullara) #7

Here is my brute force solution that is still not transactional of course:

(Amirouche) #8

Is there an implementation we can have a look at?

Thanks in advance!

(Wolf Dan) #9

I think he refers to this link (not sure tho)

(Amirouche) #10

That link is dead.

Here is another implementation of the rankedset for c#.

(Christophe Chevalier) #11

I tried searching for the original files as well, but was unable to find them on any of my drives (I had to delete a lot of stuff in 2015…). I found the same link as above but it’s dead and did not seem to get a copy of it :confused:

When I ported the code, there was not much in term of unit testing, and this is all that I got to work with:

So… yeah make with it what you want, but I don’t think this has been more than a proof of concept. Also it predates a lot of newer features, so there’s probably better ways to do this nowadays?