One of FoundationDB’s original call to fame was the ability to use it an SQL database (via FoundationDB’s aqcuisition of Akiban). Have there been any good writeups on strategies one can employ to efficiently layer indexes on top of FDB, or on top of key/value databases in general (TiDB and CockroachDB use similar approaches)?
I can see different ways of using a key/value store as an index. For example, one could index an integer column as a series of tuples with empty values:
["price", 100, "record345"] => ''
["price", 150, "record567"] => ''
Then to find all records with price = 100
, one simply does a prefix search on price:100:
, which effectively gets you all the record IDs in order.
This gets trickier if you want to combine multiple indexes in a query. For example, let’s say we have category and price:
["price", 100, "record345"] => ''
["price", 150, "record567"] => ''
["category", "appliances", "record345"] => ''
["category", "toys", "record876"] => ''
A query such as price = 100 and category = toys
gets trickier. If we think there are fewer records matching the price, we can scan all the keys returned for the range price:100:
and concurrently scan all the keys returned for the range category:toys:
, and to a kind of sort-merge join to intersect the two streaming sets. The downside is that if the “left” side of this intersection is big, it can scan for a long time before hitting the the first category:toys:
record.
An optimization would be to first try to find the lowest and highest keys for each “column”: For the price:100:
key, the lowest would be price:100:record345
and category:toys:record876
respectively. Since record876 > record345
, that’s our minimum, and we can start both range scans there. We can stop at the lowest high key. However, this search still performs badly if there’s little overlap between the two result sets. I imagine you can go further in optimizing this by collecting statistics about the correlations between column values.
Are there any good papers on strategies to do this better?