Basically to implement many (fancy+fast) features/ something like coprocessors need to be implemented. Need to be on the server so you lower the amount of data transfer from client/server. Since the db is a foundation, it needs functions locally on the data. And by fast the functions need to be in plain c/c++.
This then can be used for any feature: layers, auto TTL, hyperloglog, everything.
I think some of the primitives you need to roll this yourself are in place, though piecing them together might not be easy.
The language bindings have a locality API that allows you to discover the current network addresses of the storage servers responsible for storing a particular key (and another method that lets you find shard boundaries).
Given that, you could physically colocate client processes with servers, and implement your own routing to send expensive queries to the right machines. You’d have to deal with the fact that data can (and frequently does) move around. It wouldn’t be the easiest thing, but it’s definitely possible.
I could see a world in which some kinds of predicates or filters could be pushed down through the API itself and executed at the server or even in the storage engine, but I suspect that’s a long way off.
The originally envisioned way to do this is to co-locate layers on the same machines as fdbserver, but running in separate processes and using the FDB client as usual. The layer then uses the locality API to determine where data is located and direct requests to its own processes in the appropriate place. If it doesn’t get that exactly right it will hurt performance a bit but the semantics will be fine.
fdb’s client can then be optimized to use shared memory rather than network communication in this scenario. It should be just as fast as compiling coprocessors into fdbserver, and you get all the benefits of process (and container) isolation and keep abstraction layers intact.
I’m open to discussion of other approaches, though. After many years of FDB usage I’m not aware of anyone actually doing this in production, so maybe there is something about it that is too hard. Maybe we just need a nice OSS framework for this architecture?
See tikv (rust, ~ fdb clone) and tidb (golang, mysql on top). They have coprocessors on both layers.
Doing a select count(), can be split into multiple count(*) for each range and be computed locally without (data transfer, serialization, deserialization) between processes
In the end you’ll probably need both, a function on the local data to compute as much as possible locally and a function on top to aggregate, which can be cpu/ram intensive so you may not want a fdb-process (single core) to become a bottleneck when combining results from multiple processes.
Examples are: tidb (aggregation) tikv (local data+function), elasticsearch (data nodes query local data, query nodes do aggregation from all data nodes), facebook-feed uses same design with leaf/aggregator nodes (big boost in perf with node specialization), memsql has leaf/aggregator nodes etc.
Maybe you can implement functions in the c client to act as aggregator but you still need local functions.
Your examples should be implementable very efficiently with the architecture I described. And you can write them in any language and threading model rather than having to conform to the relatively spartan environment of fdbserver.
We are essentially doing coprocessors today with the locality API, running a process on the same machine, query for shards (that’s relatively fast), query ss ips for each shard and filter to those that are on the same machine (that’s slow for very large clusters), since there’s replication, pick a machine with some deterministic logic so that each shard has exactly one processor working on it and do work.
Probably not, if they can be hooks into writes that might be helpful, to us, the majority of the efficiency gains is in colocating the process on the same machine.
Well, there’s just not an easy way to get ss ips for shards in a single iteration so you end up doing it repeatedly (and in a controlled fashion to not nuke the stateless processes).
there’s just not an easy way to get ss ips for shards in a single iteration so you end up doing it repeatedly (and in a controlled fashion to not nuke the stateless processes)
If you wouldn’t mind dumping a bit more information into an issue on GitHub, that’d be great. This sounds like the sort of thing anyone would run into when trying to do a co-located process for a layer, and thus should probably be addressed sometime.