Hello! Thought I’d share something I’ve been hacking on: to get my feet wet with FDB, I made a layer that implements the bones of the Zookeeper 3.4.6 API: https://github.com/pH14/fdb-zk
I can’t say I’d recommend using it yet, but I’m soliciting feedback for next steps / design choices, and of course, contributors are welcome
The ideal goal is to offer a drop-in replacement for Zookeeper on top of FDB that can support much higher read/write volume and data size than ZK. ZK’s API is relatively simple, so it seemed like a good opportunity to show off FDB’s flexibility, and for me to learn a bit more about each system. I think it’d be valuable if you could spin up FDB, fire up Curator, and voila: scalable lock server. Or further lean into layers, and be able to replace multiple systems, like both Mongo / and that-ZK-cluster-you-need-for-other-system-X, with just FDB.
That said, let the tome commence:
K/V modeling
ZNodes on the DirectoryLayer
In terms of key/value design, a ZK node is largely modeled on top of the DirectoryLayer
, which maps quite well to the problem. The DL reduces key size (at the cost of more reads), and already enforces invariants that align with the ZK API – e.g. it won’t let you create a subpath unless the parent directory already exists (no mkdir -p
). The fact that the DL’s prefixes are assigned per-path also means that the full keyspace will be used, even for large directory trees. This reduces locality, but in the access patterns I’ve contemplated, is a potential win to distribute load.
ZNode stats
For an individual ZK node, the DirectoryLayer
-assigned subspace is then used to store a variety of keys that map to node values. All of the stat
entries, such as creation txn id, last-modified txn id, data size, etc each have their own key within the node’s subspace: directoryLayer(zk_node_path):stat_key_name --> stat_key_value
. Fetching the full stat
is then a simple range scan: directoryLayer(zk_node_path):stat_key*
, and individual stat entries can be fetched directly.
ZNode data
The node’s data is broken into 100kB chunks, which are pushed into a keyspace that boils down to directoryLayer(zk_node_path):data:blockN --> byte array
. Fortunately, ZK has a pretty low default data-size cap, so we don’t have to do any shenanigans to avoid hitting FDB’s max-data-size-per-transaction limit.
Global Transaction ID
One of the signature aspects of ZK is its global transaction id, which supplies a total ordering to all writes. To replace this, we use FDB’s versionstamp and write this value in every place that uses a ZK global id. While still monotonic (with caveats, more on this below), this differs from ZK which begins at txnid 1 and increments by 1 each time. This unfortunately means it’s not super easy to run fdb-zk against ZK’s tests which looks for +1 incremented ids.
This is enough to get the basic CRUD operations on ZK nodes going, and some things fall into place really nicely. Things like fetching children nodes is pretty trivial on top of the DirectoryLayer, since it’s already tracking children of a given path.
Watches
Implementing ZK’s watches requires a bit more trickery: if someone leaves a on-creation watch on node /locks/some-id
, how do we know what path the DirectoryLayer
is going to assign for that node? We don’t, so instead we leave a watch on a key: e.g. directoryLayer(watch-subspace) : /locks/some-id : on-create
, and when the node is created, the layer sets the value of this key to the current versionstamp, triggering the FDB watch (note: we use a versionstamp so we can guarantee the value is different each time, since rewriting the same value to a key won’t trigger a watch). Watches on child-creation and deletion are also supported.
Client Sessions
Session handling is a bit of a WIP right now. The thought was to borrow a tip from the Record Layer, which does some cool stuff like leaving cursors inside FDB itself, and store all session information + heartbeat timestamps in another subspace. That’s what’s currently done right now, but it needs more TLC and testing. The rough schema is something like: directoryLayer(session_subspace) : client_id --> heartbeat timestamp
, where client_id
is derived from yet another versionstamp.
Actually running it
For now, the layer exists as a co-hosted process to an application, and the application can continue to use its preferred ZK client pointed to localhost
. This allows it to work in a language-agnostic, client-only manner, similar to the design of the Document Layer. If you wanted, the process could also live as a FDB coprocessor, and clients could hit that directly instead.
┌──────────────────────┐ ┌──────────────────────┐
│ ┌──────────────────┐ │ │ ┌──────────────────┐ │
│ │ Application │ │ │ │ Application │ │
│ └──────────────────┘ │ │ └──────────────────┘ │
│ │ │ │ │ │
│ │ │ │ │ │
│ ZooKeeper │ │ ZooKeeper │
│ protocol │ │ protocol │
│ │ │ │ │ │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ ┌──────────────────┐ │ │ ┌──────────────────┐ │
│ │ fdb-zk service │ │ │ │ fdb-zk service │ │
│ └──────────────────┘ │ │ └──────────────────┘ │
└──────────────────────┘ └──────────────────────┘
│ │
FDB ops FDB ops
│ │
▼ ▼
┌───────────────────────────────────────────────────┐
│ FoundationDB │
└───────────────────────────────────────────────────┘
The layer sits behind a janky shim of the actual Zookeeper server, which handles all request/response IO and translates calls into FDB operations. At the moment, this is definitely the weakest component, and the thing I’ve spent the least time on. The ZK server isn’t terribly heavyweight, but there’s no reason to run much of anything except the bare essential network IO and an FDB client. The flip side: it meant not having to write ZK protocol network code… yet
Overall, I’ve gotten some Curator libs to sometimes work on it, but I’m missing something in the network / session handling code right now that can cause disconnects. I’ve also had issues running actual ZK locally, so I’m clearly doing something foolish ^^
Of course, if you’re aiming to replace ZK, a system whose purpose is simplicity and stability, you’d want a lot of testing.
Open questions (the interesting part!)
Ephemeral nodes
There are some elements of ZK functionality that are hard to emulate in a purely stateless-client environment. A key example of this is ephemeral nodes: given that clients can die at any time, who becomes responsible for cleaning up ephemeral nodes that are supposed to expire at client disconnect?
Spitballing ideas:
- Clients periodically scan the session subspace and look for related unexpired nodes past their prime and remove them
- Deploy a dedicated fdb-zk client whose job is for central-server-like operations
- Deploy a fdb-zk layer coprocess on the FDB nodes themselves that are responsible for central-server ops on their local key subspaces
- Do everything at read-time – if a client is asked to create a node that already exists, check the sessions table if it should have been expired, if so, create a new node, if not, throw an error.
Viability of versionstamps
Using versionstamps to replace ZK’s transaction id feels nice at first, but given the discussions in VersionStamp uniqueness and monotonicity, I’m not sure they’re truly a viable option. Trusting that the versionstamp will always be increasing is an important aspect of the design. Having a forever-monotonic id would be a really powerful construct.
ACLs
Full confession: despite trying to mimic ZK, I actually know very little about Zookeeper in practice. I’m not familiar with how the ACLs are used, nor how we’d want to store & enforce them.
Watch Guarantees
ZK states this about watches (from https://zookeeper.apache.org/doc/r3.3.3/zookeeperProgrammers.html#ch_zkWatches) :
Watches are ordered with respect to other events, other watches, and asynchronous replies. The ZooKeeper client libraries ensures that everything is dispatched in order.
A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.
The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service.
I haven’t thought too much about these guarantees or looked into how they’re implemented, so I’m not sure using FDB’s watches the way fdb-zk
currently does maintains the same semantics.
Misc. feedback on using FDB / Java bindings
So far I’m pretty happy at how quickly this has moved along, given that it’s 1x engineer working on it on the side, and a lot of that credit goes to the data model of FDB and the existence of the DirectoryLayer
. Of course, there were a few things that came up that caused some confusion. I’m semi-hesitant to include them here since I don’t want to distract from the questions above, but, hey, here goes:
-
Versionstamps need more documentation if they’re to be a first-class feature. The interplay between the commit version, transaction batch order, user bytes, 80-bits vs 96-bits flavors is a lot to unravel. I leaned heavily on some existing forum posts like VersionStamp vs CommittedVersion, and it doesn’t seem too terrible to distill down the learnings there to some concise usage guides / API docs.
-
little-endian-only atomic operations was a surprise. All of the byte serialization was written using Guava, which uses big-endian for integer <–> byte array conversion, so this was a painful realization. Would there be interest in supporting big-endian ADD / MIN / MAX natively as well? That would be enormously helpful for many projects that already live in a big-endian world.
-
Handling versionstamps / commit versions in the Java bindings gets tricky. There have been a few posts on this, but returning a
CompletableFuture
that only completes after the transaction completes can require a bit of contorting to work around when that value is relevant for other aspects of the transaction / application operation. -
You can’t cancel a transaction from within
fdb.run(tr -> lambda)
without throwing an exception, since it’ll try to commit once the function completes and that fails if the transaction has already been cancelled. -
Unit testing wasn’t super simple. I pushed the whole app into its own directory subspace, so every key is actually prefixed by
directoryLayer(fdb-zk)
. Right now the tests blow away the whole directory and start over each time, but they could ideally reassign the root directory todirectoryLayer(fdb-zk-test)
. An embedded test server would be nice.
Phew! That was a lot. Let me know if you have any thoughts on the design choices so far and ideas for tackling the open questions (and further questions I might be missing!). I am neither ZK nor FDB expert, so I’m sure there’s a lot in my blindspots. If this is interesting to anyone… PRs welcome