FoundationDB

Missing API for getting just the count of a key range?


(Cloudspeech) #1

I started looking into foundationDB more closely, sofar it’s quite exciting!

Unless I missed something (entirely possible), however, it seems the fdb API / bindings don’t
offer a way to just get a count of the number of keys in a key range, not the actual keys.
It seems I would have to iterate over all the keys to obtain that count, which will produce
network traffic proportional to their combined size, or not?

In the context of the ‘simple indices’ data modelling recipe, such counts could be used to report how many items are indexed under a given dimension, e.g. think e-commerce and number of products with given category tuple slot such as ‘fashion’.

Since keys are lexicographically ordered, counts would seem to
be efficiently computable server-side.

And since counts can be
a compact estimator for purposes of expected network bandwidth before transmitting
a large keyrange, or for query planning - e.g. start matching using the lowest-count criterion, then
intersect with the next one etc. - they seem to be broadly useful.

Is there a deep reason why there are unavailable? Could they be added cheaply?
Do others see the same need?

Any insights greatly appreciated!


(Richard Applebaum) #2

I think it depends on the Language you are using,

I’ve been playing around with Chris Lattner’s Python/Swift Interoperability Playground:

Chris admits that the syntax in currently a little bit kludgey, but is working on improvements. Here’s an example of priming the FDB Class Scheduling database:


(Jay Kominek) #3

In your example of categorized products, your layer on top of FDB could keep track of that count in its own key as other records are modified. Then all you’d need to fetch would be a single key.

You could extend that to infinite data types by binning them up algorithmically. Integers and reals could have the count stored in floor(value / 1000.0), or whatever else works.


(Alec Grieser) #4

So, the main reason to not have a “getRangeCount” API is that, as the storage servers are implemented, there isn’t actually a better way (at the storage server level) to count the number of keys in a range without just iterating over all of them and counting (essentially what the client has to implement now). I’m not sure how much I buy this argument myself, but in some sense, by not having the API, it means that we can discourage users who maybe don’t realize the cost of doing the operation from naïvely making that call and generating way more load on the cluster than they expect. (Our KeySelector API, if one gives it large offsets, has essentially the same problem, so the API comes with warning labels letting people know that it shouldn’t be used with large offsets.) Granted, this function could help reduce network traffic as the “count” function is now effectively pushed down, so maybe it’s worth doing (or maybe some more general server-side function functionality is actually what is needed).

Other arguments against it:

  • If you wanted to do this the “optimal” way–where each storage server knows how many nodes are between any two keys, then I think the only way to do it is to have each page keep track of how many nodes are in each child tree. This is additional space and bookkeeping that has to be updated with every write, which could be significant write amplification (essentially the same as a COW B-tree). (If you only wanted an estimate, the storage server could sample keys in a way similar to the way shard sizes are estimated, I suppose…)
  • Implementing a system that gives the right answer in the presence of outstanding uncommitted mutations within a transaction seems hard. (In other words, this operation might not support reading your writes.) You would either need to just ignore local writes, throw an error if a local write existed that was (or might be) in the counted range, or send all (relevant) mutations in the request so that the storage server could account for them when counting. It could also live outside the context of a transaction, but that would put it at odds with essentially all other parts of our API.

So, what’s the actually good way to implement counting the number of keys? Well, it doesn’t work for arbitrary ranges, but for doing things like counting the number of keys in ranges that you know you will want to query (e.g., indices so you can, as suggested, choose the most selective filter when query planning), you can use atomic operations! In particular, you will want to use the ADD mutation. So, for example, if you wanted to implement keeping track of the size of multiple indices, you could do this by keeping each index in its own subspace, and then using a separate subspace (with maybe one key per index) with one key each indicating the count of that index. Then whenever you update the index, you also (transactionally) issue an atomic add to that key.* Then when you need to get the size, it’s just a single-key read, which is pretty fast. (In this index example, you can actually get the sizes of all indices with a single range read, which is nice.)

I believe that this covers most of the shortcomings of a more general “getRangeCount” API:

  • Queries require looking at a single key, not (implicitly) running a scan over an arbitrarily large amount of data.
  • Write amplification and extra space is lower than if each page kept bookkeeping information about how many keys were in its children.
  • If there are local writes, the RYW cache will be able to combine local updates with a value read from the database to produce the correct result.

There are a few downsides:

  • At high volumes, it can create hot-keys on the counter keys. (This can be somewhat alleviated by splitting up the work to multiple keys. If you are write-heavy, then you want to load balance your writes and then read all keys. If you are read-heavy, then you want to load balance your reads and write to all. If you are somewhere in the middle, it’s possible you might need to have a lattice and have all writes go to a row in the lattice and all reads go to a column in the lattice.)
  • It is now on the application- or layer-developer to keep this bookkeeping around and in sync (including possibly building a count key from existing data).
  • It only supports getting the right answer for ranges you know (or think) you will need to count often. (You can get around this if you are okay with (once every so often) doing a big expensive scan to count something you didn’t expect to need to count and sometimes building a count from existing data if you were initially wrong about what needs counting.)

But even with the limitations, I think it probably covers most users’ needs.


* If you want to be strictly correct, you will need to check to see if the key exists before you write to the index, which you may or may not know without having to read the index entry. If you just want to be roughly correct, you can (probably) get away with just incrementing the count key every time you do a set to the index and decrementing it every time you do a (single-key) clear.


(Cloudspeech) #5

How is the range.count implemented underlyingly? That’s the key question. Since other language bindings use the C binding, as far as I understood, they would probably have no choice but to do enumeration of ranges, thus falling back to the problem I outlined. If you can demonstrate that is not the case for the Swift binding, I would be very interested, of course.


(Christophe Chevalier) #6

You can see an explanation in this thread => Getting the number of key/value pairs

.


(Cloudspeech) #7

Wow, thanks a lot for your in-depth answer!

I think I can live with range-counts-as-additional-data suggestion, since indeed the applications I have in mind exhibit ranges I know I will want to query.

As the main reason for the missing "getRangeCount” API you mention storage servers - are these synonymous with storage engines? If so, would the ongoing discussion for pluggable additional engines benefit from also considering efficient range counts as a criterion?


(Cloudspeech) #8

But this thread doesn’t mention Swift at all, and is about all keys in a data base, isn’t it?


(Christophe Chevalier) #9

Swift, Java, Python and .NET all use the same underlying C API, so the same algorithm apply on all of them. The Swift binding probably implements a similar solution than the others.

You can count a subset of the database by using subspaces. I think the JAVA example from @spullara at the bottom counts all keys in the db, while the .NET code can count inside subspaces, but this is trivial to add.


(Cloudspeech) #10

Before this gets out of hand due to misunderstandings accumulating - it has already been clarified earlier in this thread that there is no efficient API call in the underlying C API, for good reasons related to the design and architecture of FDB. So, no need to speculate further.


(Alec Grieser) #11

So, the storage server and storage engine aren’t quite synonymous. A storage server is a process that provides endpoints to allow for storing and retrieving key value ranges. All FDB reads go directly to the storage servers, and all writes eventually go to them, though the pipeline is a little more involved.

The storage engine is the underlying structure that a storage server sits on top of. I don’t think it would be too incorrect to think of a storage server as a process with a storage engine and a network interface. (The code for a storage server is in storageserver.actor.cpp. The storage engine sits behind an IKeyValueStore. Creating a new storage engine usually involves creating a new implementation of the IKeyValueStore interface and plumbing it through so that storage servers can actually use it.)

If being able to get key counts for arbitrary regions was super important, I could see the argument for making a new storage engine that kept that bookkeeping around. Of course, it would only actually be useful if the endpoint were somehow exposed. If one were already making a new storage engine, I could see the logic in modifying your design to be more accommodating to such queries. One would still have to grapple with the fact that there is write amplification on the order of a COW B-tree (at least, in the implementation I have in my head) if done trivially, which I think is somewhat fundamental to the problem. You could probably do something like keep a separate disk-backed datastructure that was something like a finite number of keys with the number of keys and values between signposts. (And also the RYW issue can’t be solved by a niftier storage engine.)


(Amirouche) #12

(The topic suggestion tool of this forum works very well, I started typing a question and I ended up here)

So, if understand correctly range count is not implemented because storage server would do the same as the client ie. fetch all keys with their associated value and count them. The workaround is that it’s up to the layer-application to create counts using atomic operations.

That would mean that instance when executing the following pseudo SQL query:

SELECT * FROM message WHERE 456 < created_at < 789 AND user=42

The query planner can not choose the optimal index that is on created_at because I can only precompute the total count of keys in that index and not the range count of the particular slice of keys I need to fetch. Both indices will have the same count by storing more data as metadata of the index subspace.

AFAIK one way to workaround that is to split the index of created_at into several sub-indices like with elasticsearch (!) which leads anyway to more “pressure” application layer side.

I am thinking out loud, but splitting manually is painful, maybe there is probabilistic-based solution for that problem that could automate splitings and querying even if the range count is an approximation.

Also, maybe in the above example query, a compound (created_at, user) index using a space filling curve like morton code can do the job.


(Amirouche) #13

This reply in another topic is helpful Getting the number of key/value pairs it propose to manually build statistic but automate it a little bit. Also it seems like it also useful for doing a random key lookup.


(Alec Grieser) #14

I think that there is a claim here that if you had an index on created_at, then you couldn’t use it to find the number of results between two arbitrary points without a more sophisticated data structure. That is correct. That reply you linked to points towards a more sophisticated data structure that you might implement to do so, though.

You could use a space filling curve to satisfy those two predicates in your query. If you had a regular one on (user, created_at), then it’s just a scan from (42, 457) to (42, 789), though, which is definitely easier. This only works if you need equality on the user ID and don’t have to do any range scans (which might be the case for some use cases).