FoundationDB

Any chance of server-side Boolean operations like intersection?


(Cloudspeech) #1

http://tylerstroud.com/2014/11/18/storing-and-querying-objects-in-redis/ is a short treatise on the query expressivity of Redis, a popular K-V store.

For my purposes here, it convincingly argues for server-side Boolean operations over sorted keyrange results to allow efficient querying by multiple criteria.

Intersection is probably the most important such operation, followed by union and set difference.

A query paraphrased as “Give me all products under the Men’s trousers subsection of the category index and/intersected with the price index ranging from 30 - 100 Euro” might serve as a made-up e-commerce example to illustrate the utility of the idea.

If we follow the ‘simple indices’ data modelling pattern of FDB, with keys like

index/criterion1/object_id_1 = ‘’

index/criterionN/object_id_k = ‘’

it seems we want some kind of intersectingIterator for the most important Boolean operation. Arguments would be an array of RangeResults and some way to specify a tuple position where the key subpart to use for set intersection can be found (e.g. the last position in the sketched index keys above).

Or we make intersectingIterator self-composable: intersectingIterator(RangeResult, tuplePos) -> RangeResult.

The argument for doing this server-side would be to drastically reduce the number of intermediate results that need to be transmitted otherwise, i.e. when doing Boolean operations client-side.

Underlyingly, something like https://roaringbitmap.org/ might be used to implement this efficiently.

What do people think - is this truly needed, can it be efficiently emulated with the existing API, if not, what are the chances this will be added?


(Cloudspeech) #2

Addendum: Apache Accumulo turns out to have an ‘intersecting iterator’ with much of the same motivation, described here: https://accumulo.apache.org/docs/2.0/getting-started/table_design.


(Richard Applebaum) #3

Doesn’t the fact that the FDB database and any indexes that you create are ordered k/v pairs, largely mitigate the issue you describe?

As to the boolean operations, couldn’t they be implemented as FDB layers on the underlying db & indexes, to provide the server-side efficiency you desire?


(Christophe Chevalier) #4

Here is an experimental layer that uses compressed bitmap for indexes:

I think it uses some variations of a Word-Aligned Hybrid bitmap encoding to represent indexes. It hasn’t been tested in production. And looking back at the code it is not very efficient and could probably be optimized a lot!

The bitmap format is documented here but any other encoding (CONCISE, Roaring, …) would probably work the same.

We also use a more traditional indexing solution like describe above, and use Merge Sort on multiple GetRange to perform union or intersection (easy to do with an already ordered k/v store). It works well but we only have up to a few tens of thousands of documents in the collections, so I’m not sure it would scale to more (millions or billions) because you would probably have to read too many data.

There are some gotchas when doing merge sort, when you do queries on range of values: (VALUE, ID) yields a list of sorted ids for a single value (ex: VALUE=42) but not for a range of values (40 <= VALUE < 50), so you will have to do an intermediate sort on the complete range read, before piping the results into a Merge Sort. This can be the limiting factor for a large number of documents.

Compressed bitmaps can solve this to some degree… but I don’t have enough experience with them to tell you.


(Alec Grieser) #5

I think set intersection across multiple keys is probably outside the scope of the FDB key-value store, though it could very well be done (perhaps using a bit map, perhaps doing a merge operation across sorted lists) by a layer. Some of the other answers on this thread go into that a little more, but it seems to me like you would be heavily influenced by things like your data model when implementing that kind of thing, which is after the end of where the key-value store’s API.

I’m fairly confident that this could be implemented in a way that is efficient enough for many use cases, but it would look something like issue iterators for the different ranges you want to combine, and then perform the combination (intersection, union, set difference, etc.) locally.

If you needed something fancier, a not-so-unreasonable way to implement this might be some kind of generic server-side code execution API, though adding that would be quite a bit of work. It also wouldn’t help you terribly much in the case that two key ranges exist on different shards. (For example, Redis, to my knowledge, requires that all data used by a single sever-side Lua script evaluation, live on a single host.) So I think you would want something that, for example, executes the join on the server if all of the data are on a single machine and executes it locally if some are on one server and some on another. You could get most of the way there by using the same execution engine locally as is used on the fdbserver instance, but some of the details get a little involved. (It might also require falling back to doing things locally in the presence of outstanding writes in your transaction if you want to maintain RYW.)


(Christophe Chevalier) #6

One thing that could help and be somewhat easier to implement, is some logic that would allow get_range queries to only return part of a key and/or value, but using some conditional logic. Maybe even filter out the result if the part does not match some expected byte sequence.

One use case would be to allow the equivalent of a TABLE SCAN to do ad-hoc indexing on documents stored with some sort of binary encoding (JSON, protobuf, JSONB, etc…). Currently, the application must fetch the entire document, parse it and check a few bytes from the “City” field (for example), to then discard the document if it does not match. Repeat this a few thousand/million times, and the query will timeout.

If the get_range query can extract only part of the document, then we are returning less data on the on wire, and if it is able to also filter out results based on a comparison with a byte literal, we could probably make it even faster.

Some encodings are maybe to complex to express, like parsing text JSON is probably too cpu intensive on a range query, but some binary formats (I’m thinking of Postgresql’s JSONB) allow fast random access to fields without parsing the entire document and could be a good match for this. There is also other formats like how most SQL databases lay out records on the disk, that are very easy to consume (most fields, except for varchars or nullable entries, are at a fixed offset from the start of the row).

Now the question is how to express such query?

One extreme solution would be to teach the storage nodes about JSON, protobuf, JSONB, Cap’n Proto, MessagePack, and so on…

  • This would be a huge task and there will always be new formats and variants that people will want to support.
  • For the keys, the Tuple encoding seems to be frequently used, so this would be a safe bet, but nothing prevents you from using a different binary format which may use variable-length encoding.

Another extreme solution would be to have some sort of virtual machine that can run bytecode and execute complex logic on the storage nodes.

  • To avoid reinventing the wheel, we could reuse something like luaJIT, or even more complex JIT like .NET Core/RyuJit or any embeddable JVM ?
  • I’m not convinced this could be done easily, even for the fact that fdb process are single-threaded and having foreing user code running on this thread could completely stall the process (a simple “while(true) { …}” with a borked exit condition would deadlock the process).

I’m not sure what would be the best solution to, at the same time, reduce the number of bytes sent over the wire for large reads, and also project some business logic closer to the data itself without impacting the safety of the storage nodes and deadlocking the single thread. Maybe having dedicated threads to execute user “stored procedures” outside the main event loop, but this would probably break a lot of invariants that the current code relies on (serialized execution, no locks, etc…)

Are there any similar features that have been implemented into other distributed database? I’m sure they would be faced with similar issues when executing foreign code without the database storage server itself, and had to take some measures to protect the integrity of data?


(Cloudspeech) #7

I don’t see a logical connection between set cardinality and set orderedness. Besides, search engines and databases have use ordered data structures for intersection and related operations for decades, and still this is one of the dominant factors affecting performance.

As to the second suggestion, I guess I don’t quite follow: those layers are not server-side, as far as I can see, so the network bandwidth issue remains unsolved, or not?


(Cloudspeech) #8

Thanks for all the other explanations, pointers to implemented layers and suggestions here - a lot to think about!

A gut reaction: if layers are an unavoidable part of an efficient solution here, but one also desires the nice generality of server-side programming a la Apache Accumulo - I for one would strongly prefer to see LuaJIT here - why not try to have both? Run additional client(s) in each machine of each FDB cluster, implement ‘server-side’ programming-as-layers there while profiting from data locality and machine-local network speed (e.g. Unix domain sockets), and use intra-cluster messaging between those clients to coordinate who does what. It’s nothing but a hazy sketch at this point, of course.


(Alec Grieser) #9

I think there are a few more details on what such a scheme would look like in this thread: Coprocessors or modules

It also sounds like as a constructive proof that you could do that that Wavefront already have such a system set up. It is probably the easiest way to do something like what you’ve described without a new API to push down operations.