So, the main reason to not have a “getRangeCount” API is that, as the storage servers are implemented, there isn’t actually a better way (at the storage server level) to count the number of keys in a range without just iterating over all of them and counting (essentially what the client has to implement now). I’m not sure how much I buy this argument myself, but in some sense, by not having the API, it means that we can discourage users who maybe don’t realize the cost of doing the operation from naïvely making that call and generating way more load on the cluster than they expect. (Our KeySelector API, if one gives it large offsets, has essentially the same problem, so the API comes with warning labels letting people know that it shouldn’t be used with large offsets.) Granted, this function could help reduce network traffic as the “count” function is now effectively pushed down, so maybe it’s worth doing (or maybe some more general server-side function functionality is actually what is needed).
Other arguments against it:
- If you wanted to do this the “optimal” way–where each storage server knows how many nodes are between any two keys, then I think the only way to do it is to have each page keep track of how many nodes are in each child tree. This is additional space and bookkeeping that has to be updated with every write, which could be significant write amplification (essentially the same as a COW B-tree). (If you only wanted an estimate, the storage server could sample keys in a way similar to the way shard sizes are estimated, I suppose…)
- Implementing a system that gives the right answer in the presence of outstanding uncommitted mutations within a transaction seems hard. (In other words, this operation might not support reading your writes.) You would either need to just ignore local writes, throw an error if a local write existed that was (or might be) in the counted range, or send all (relevant) mutations in the request so that the storage server could account for them when counting. It could also live outside the context of a transaction, but that would put it at odds with essentially all other parts of our API.
So, what’s the actually good way to implement counting the number of keys? Well, it doesn’t work for arbitrary ranges, but for doing things like counting the number of keys in ranges that you know you will want to query (e.g., indices so you can, as suggested, choose the most selective filter when query planning), you can use atomic operations! In particular, you will want to use the ADD
mutation. So, for example, if you wanted to implement keeping track of the size of multiple indices, you could do this by keeping each index in its own subspace, and then using a separate subspace (with maybe one key per index) with one key each indicating the count of that index. Then whenever you update the index, you also (transactionally) issue an atomic add to that key.* Then when you need to get the size, it’s just a single-key read, which is pretty fast. (In this index example, you can actually get the sizes of all indices with a single range read, which is nice.)
I believe that this covers most of the shortcomings of a more general “getRangeCount” API:
- Queries require looking at a single key, not (implicitly) running a scan over an arbitrarily large amount of data.
- Write amplification and extra space is lower than if each page kept bookkeeping information about how many keys were in its children.
- If there are local writes, the RYW cache will be able to combine local updates with a value read from the database to produce the correct result.
There are a few downsides:
- At high volumes, it can create hot-keys on the counter keys. (This can be somewhat alleviated by splitting up the work to multiple keys. If you are write-heavy, then you want to load balance your writes and then read all keys. If you are read-heavy, then you want to load balance your reads and write to all. If you are somewhere in the middle, it’s possible you might need to have a lattice and have all writes go to a row in the lattice and all reads go to a column in the lattice.)
- It is now on the application- or layer-developer to keep this bookkeeping around and in sync (including possibly building a count key from existing data).
- It only supports getting the right answer for ranges you know (or think) you will need to count often. (You can get around this if you are okay with (once every so often) doing a big expensive scan to count something you didn’t expect to need to count and sometimes building a count from existing data if you were initially wrong about what needs counting.)
But even with the limitations, I think it probably covers most users’ needs.
* If you want to be strictly correct, you will need to check to see if the key exists before you write to the index, which you may or may not know without having to read the index entry. If you just want to be roughly correct, you can (probably) get away with just incrementing the count key every time you do a set to the index and decrementing it every time you do a (single-key) clear.