Secondary indexing approaches

One of FoundationDB’s original call to fame was the ability to use it an SQL database (via FoundationDB’s aqcuisition of Akiban). Have there been any good writeups on strategies one can employ to efficiently layer indexes on top of FDB, or on top of key/value databases in general (TiDB and CockroachDB use similar approaches)?

I can see different ways of using a key/value store as an index. For example, one could index an integer column as a series of tuples with empty values:

  ["price", 100, "record345"] => ''
  ["price", 150, "record567"] => ''

Then to find all records with price = 100, one simply does a prefix search on price:100:, which effectively gets you all the record IDs in order.

This gets trickier if you want to combine multiple indexes in a query. For example, let’s say we have category and price:

  ["price", 100, "record345"] => ''
  ["price", 150, "record567"] => ''
  ["category", "appliances", "record345"] => ''
  ["category", "toys", "record876"] => ''

A query such as price = 100 and category = toys gets trickier. If we think there are fewer records matching the price, we can scan all the keys returned for the range price:100: and concurrently scan all the keys returned for the range category:toys:, and to a kind of sort-merge join to intersect the two streaming sets. The downside is that if the “left” side of this intersection is big, it can scan for a long time before hitting the the first category:toys: record.

An optimization would be to first try to find the lowest and highest keys for each “column”: For the price:100: key, the lowest would be price:100:record345 and category:toys:record876 respectively. Since record876 > record345, that’s our minimum, and we can start both range scans there. We can stop at the lowest high key. However, this search still performs badly if there’s little overlap between the two result sets. I imagine you can go further in optimizing this by collecting statistics about the correlations between column values.

Are there any good papers on strategies to do this better?

The jargon you are looking for to start your literature search is probably “index intersection query”. Maybe also “compressed bitmap indices”.

Of course, if you are designing for that specific query, you can just make a compound index on both fields (“price_category”,100,“appliances”,“record345”).

Thanks. I know a lot about database theory, but FDB introduces constraints that affect what kind of strategies you would use. For example, the fact that FDB is always remote means that every key/range access has a much higher penalty than, say, node-local B-tree indexes.

To get good performance, you need to have many concurrent reads in flight at the same time. Cardinality statistics will help to know how many reads to do in each range of keys that represents the index. High cardinality range would require lots of reads, low cardinality range just one read. Split into non-overlapping chunks as appropriate. This is pretty easy since the primary keys at the end of the index key are already sorted for you for each index prefix.

I wrote a basic version just now in Node that does the actual calculation of the index intersection by assigning each primary key an offset for a bitmap using a counter and a hash table (an actual hash table not a JavaScript object :slight_smile: ). Then I intersected them with Roaring Bitmap and outputted the result into a Uint32Array. The data itself was two large buffers of 16 byte keys that you could imagine representing keys in FoundationDB.

At two indexes, one with 1 million items and the other with 500k, the entire process took ~500ms or 333ns per processed index entry. A C++ implementation with codegen could do better, but an interpreted model would probably be worse due to cache misses. The majority of the time in this example is taken up by the hash table lookups (100ns for check key, 200ns for write key).

Reading 1.5m keys from FoundationDB would take quite a while and you’d eventually start running up against the transaction size limit. Just keys alone, without any overhead, for that specific example query would be 24mb.

1 Like

I’ve played with compressed bitmaps for indexing, but one issue I’ve had is that they assume that primary keys are sequential integers without (or very few) gaps. This does not work well with scenarios were UUID (or even VersionStamps) are required.

I could add a “shadow” primary key that is an integer just for the purpose of indexing, but then I shift the problem to the write side: a sequential global id creates potential conflicts, and also requires that you have to successfully talk to the db before knowing the id of new items, while uuids can be generated in advance.

Are there any good strategy to still get the benefits of compressed bitmaps but keep (random) UUIDs or very sparse sequence numbers like Versionstamps? Right know, I’ve had to resort to doing a combination of merge sorts on parallel index range reads, and finishing up with an optional filter on the raw documents for non-indexed fields, which is a bit wasteful in term of network bandwidth.

If you only need it in the context of a single query, you can just build the bitmap at runtime. You can dictionary encode keys using a hash table and use an integer counter variable to hand out indexes in the bitmap.

This can be parallelized by reading non-overlapping ranges of keys. The resulting bitmaps are not comparable to each other in that case though without extra work.

If you have reasonable number of “root” entities that only compare and join to their children, you could create and update an individual counter for each one of the root entities. Then you would be able to create the bitmap without the dictionary encoding step for each root entity.

One approach to use compressed bitmaps for indexing mutable collections (over FDB or elsewhere) is to basically keep two index partitions: one compressed bitmap and one “standard” (value, pk) secondary index, such that each row is in one of them. Insertions and updates go into the standard index, a background process rebuilds the bitmaps “slowly” so as to keep the standard index small, and reads look at both. (You might or might not actually do an index intersection query for the second part of a read; you also have the option of doing a nested loop intersection where you just scan the smallest index and look up the rows in question. But either way you are only reading a fraction of rows from the (expensive to read) standard index and only updating the (expensive to write) compressed bitmap index at a fraction of your actual write rate. And you only need to assign the bitmap index IDs as you rebuild the bitmaps, so the concurrency problems with that are minimal.

Another approach is to try to update the compressed bitmaps in place. As Christophe says, you need to have an efficient way of allocating reasonably compact identifiers without causing transaction conflicts. The directory layer’s allocator doesn’t meet all the requirements but I think that something that does is doable. If this turns out to be an important approach, it would probably be worth adding some atomic operations for updating compressed bitmap blocks.

Whether you are doing this or something less sophisticated, of course you need to make sure you keep parallel queries outstanding to hide network latency.