For the past few months, I’ve been working on a Record Layer-like crate (library) for Rust.
Thanks to the help provided by @MMcM and @alloc, the design of some of the foundational types and traits (“interfaces” in Java/Go) are now done. While there is still a lot of work ahead to make the Rust library usable, I’ve opened the repository and the documentation.
I wanted to share my learning thus far, as I think it can be helpful to others who might be embarking on writing layers on top of FoundationDB. This post is written from a beginners perspective. It my view that understanding them will help in developing a more functional style code on top of FoundationDB irrespective of the programming language that you might be using.
I would really appreciate corrections and feedback from FoundationDB and Record Layer developers on this post.
Motivation
When developing layers on top of FoundationDB, after learning about the APIs and working through the tutorials, we will discover two things about FoundationDB which is different from other databases.
In other systems, the concurrency issue would typically be partitioned into application side concurrency and database side concurrency.
- Application side concurrency would be handled by the application or web framework along with an ORM.
- Database side concurrency would be dependent on the database that we would be using. For example with PostgreSQL, we would use a connection pooler like PgBouncer and based on our workload, increase
max_connections
parameter.
In case of FoundationDB, the binding layer authors would have done most of the heavy lifting required to integrate the C library with the language specific async runtime for handling concurrency. However, the responsibility for working around FoundationDB limitations falls on layers that use the binding layer.
Most of the information contained in this post, is about the infrastructure and design that the Record Layer uses to work around the FoundationDB limitations.
There are three main limitations that we will need to keep in mind.
Record Layer provides well designed APIs to handle the 5s transaction time limit and to ensure that value size limit of 100KB cannot be exceeded.
It does not provide APIs to check the key size limit of 10KB nor does it provide APIs to check the 10MB mutation size limit. You will need to benchmark your applications to ensure that you are not exceeding these limits. In any case you need to be aware of performance considerations with regards to key and value size.
On the topic of performance, we should also aggressively exploit any opportunity that might exist to do pipelining within a transaction. I won’t touch on pipelining in this post. However I would like to briefly mention that we would need to use some form of sub-task mechanism within the async runtime to implement it (in Tokio/Rust it would be a JoinSet).
Design
Cursor
Every major database has a notion of a Cursor (Here I am linking to IBM IMS database and COBOL just to make the point that the idea of cursor goes back to some of the earliest databases, much before the advent of relational databases).
Today, as application developers, we seldom encounter a database cursor directly in our code. So, it might be foreign to us. However, they are still present and used when writing stored procedures for relational databases.
The idea of a cursor is very generic and can be used at different levels of abstraction within a database system. In fact at the lowest level, the SQLite B-tree used by FoundationDB storage servers to store and retrieve key-values also uses cursors. (See struct BtCursor
in this file).
The first important aspect of a cursor is that it is an abstraction that can be composed. Primitive cursors can be composed to build more complex and useful cursors.
Java RecordLayer uses this to provide multiple types of cursor. In Rust RecordLayer, at the moment we only have a primitive KeyValueCursor
, but we have the infrastructure to implement additional types of cursors in the future.
The Record Layer implements a cursor API which like normal database cursors returns some value.
The second important aspect that is specific to Record Layer cursor API implementation is its ability to be aware of the FoundationDB limitations.
In the APIs the terminology of out-of-band error is used to indicate error conditions that might occur due to FoundationDB limitations.
The third important aspect of the Cursor API is its support for continuations. A continuation can be used to rebuild the cursor state across transactions.
Here are links to the documentation for
Once we have a composable cursor that is aware of out-of-band limits and returns a continuation in addition to the cursor value, the process of working around FoundationDB limitations becomes tractable and embed-able within the type system.
We can now write our code easily and confidently because we will get errors at compile time, in case we are not correctly handling the out-of-band limit conditions.
In summary, following are the key aspects of Record Layer cursors.
- Compos-ability of cursors.
- Ability to handle out-of-band limit conditions and leveraging the type system.
- Continuations which can be used to recreate the cursor state in a different transactions.
Now that I’ve motivated the need for Record Layer cursor, we can look into its implementation.
The only primitive that we have to read data from FoundationDB is a range read. All reads are built on top of this primitive including our cursors.
So in-order to build our cursor, we need a way to combine range reads and FoundatonDB limits. Record Layer does this using ScanLimiter
type.
There are three types of scan limiters within the Record Layer.
-
TimeScanLimiter
(Java, Rust) -
ByteScanLimiter
(Java, Rust) -
RecordScanLimiter
(Java),KeyValueScanLimiter
(Rust). Rust RecordLayer does not implement a semantic equivalent of JavaRecordScanLimiter
.
Of the three scan limiters, the most important one is the TimeScanLimiter
as it helps us work around the 5s transaction time limit.
When the cursor is asked to produce a value, there can be two situations.
- On success (Java, Rust) it produces a value and continuation
- On error, a
NoNextReason
(Java, Rust) is provided along with a continuation.
You will notice that there is a correspondence between the NoNextReason
and scan limiter types.
At the lowest level, range reads in FoundationDB is specified using KeySelectors
. The documentation for KeySelectors
mentions: “Note that the way the key selectors are resolved is somewhat non-intuitive, so users who wish to use a key selector other than the default ones described below should probably consult that documentation before proceeding.”
Correctly constructing KeySelectors
in the presence of continuations to build range read future can be very tricky. To avoid burdening users with this complexity, Record Layer provides two very useful types that are used in KeyValueCursorBuilder
(Java, Rust).
These two types provide a much easier to use APIs for working with ranges.
Just to highlight how subtle the problem with KeySelector
and continuations can be, in Rust RecordLayer we have 158 integration tests to exhaustively test this code path. An important side benefit of this approach is that “Key selectors with large offsets are slow” limitation is naturally taken care of.
Since continuation encapsulates the state of a cursor so it can be reconstructed at a later time, we need a mechanism to serialize continuation state. Java RecordLayer uses protobuf for this. In Rust RecordLayer we use avro.
Split Helper
While cursor helps us address the read side limitations of FoundationDB, split helper helps us address the write side limitations with an intuitive API.
Here is the documentation of split helper Java, Rust.
The split helper provides an API that splits value byte array into 100KB chunks across multiple keys for writes and deletes. These chunks are then reassembled correctly for reads. Since we already have the cursor abstraction available with us, we can use it for reads.
Split helper API also introduces the RecordVersion
type (Java, Rust).
When we write the value byte array using split helper, at the tuple offset of -1
of the key subspace, we also store the versionstamp using RecordVersion
.
Having versionstamp information allows us to reason about the order of writes within a transaction and also about order of transactions within a cluster. In the event we need to debug an issue, having this information available with us can be very helpful (we can also do something similar for our secondary indexes).
In Rust RecordVersion
implementation, we also track the “incarnation version”, so the ordering can be preserved as the value moves between FoundationDB clusters.
RecordContext
The RecordContext
type (Java, Rust) stores transaction specific state information. By centralizing transaction specific state information within one type, the rest of the APIs in the Record Layer can be written in a functional style.
We can see examples of this functional approach in the design of cursor and split helper APIs.
In the cursor API, we only specify time limit. The cursor API is not really aware of “5s” limit. Depending on when the cursor is created, its time limit could vary.
But we still need to track the elapsed time for our transaction, so the time limit for our cursors can be correctly specified. This information is maintained in record context.
Similarly, with split helper and record version, we need a way to maintain the cluster incarnation version and and transaction local version. Once again this is done using record context.
In essence, record context is what enables us to write our cursor, split helper and other record layer APIs in a functional style.
Conclusion
I’ve covered a lot of information in this post and I hope it would be helpful in developing a deeper understanding of the Record Layer.
I also wanted to have the design written up in this form because it can be something that we can point new contributors to Record Layer to.
Once again, I would really appreciate any feedback and corrections.