Record Layer Design Questions

I am currently exploring the design of the Record Layer from the ground up.

Rather than create multiple threads with my questions, I am hoping to create one thread.

My hope is that discussions in this thread will be helpful to me for my Rust layer and also for future layer developers as they try to build layers leveraging ideas from the Record Layer.

In the EndpointType type, there are two variants that are named TREE_START and TREE_END. I was wondering if there was any specific reason for naming the variants with the prefix of TREE_?

Did FoundationDB in the past have a limit on the number of key-values that could be read in a transaction?

The reason I ask is because RecordLayer includes a RecordScanLimiter as a “out-of-band” limiter.

While I understand the need for ByteScanLimiter and TimeScanLimiter given FoundationDB design imitations, was wondering if there was a reason for RecordScanLimiter to exist as an out-of-band limiter?

What TREE_ needs to indicate is that it’s something independent of the range keys passed to the API, and, moreover, the lowest / highest such thing, in context. And since that context might be an index or a the primary extent, or really any other subtree in the record store, something neutral is required.

1 Like

Records (or “rows”) scanned is a pretty good proxy for the amount of work done, particularly by a query. The current design, based partly on the underlying limitations of the key-value store, but also on a requirement to support many tenants with pretty high throughput, is that you must either limit the work done in that way, or else license the system to stop and pick up again later with even lower per-transaction thresholds.

Specifically, the number of keys scanned is a pretty good indicator of poor index definition / selection, that might still slip in under a byte limit (all the index entries / records are small) or time limit (you can get a lot read in a few seconds).

1 Like

For the setLow and setHigh methods on an instance of KeyValueCursor.Builder class, is there a scenario wherein we might set the argument lowEndpoint or highEndpoint to EndpointType.CONTINUATION?

Even though the API allows for this possibility, I was unable to think of a scenario where this could be useful. Therefore wanted check if I was missing some use-case.

I think you’re right. That endpoint type emerges in the building of the key-value cursor by KeyValueCursor.Builder.setContinuation. It’s really kind of internal, but that’s hard to do without having two enums, which seemed like more work than it was worth.

1 Like

In the design of CursorLimitManger, there is a notion of a “free initial pass”.

I was wondering if the reason why Record Layer needs to provide a free initial pass is because, the serialized form of RecordCursorStartContinuation and RecordCursorEndContinuation are the same?

As a consequence of the “free initial pass”, if a continuation is returned, it can be assumed that only RecordCursorEndContinuation would be returned.

In my design, I am explicitly modeling the notion of a start continuation and end continuation that can be returned by the equivalent of RecordCursorResult<T>. The way I am thinking about this problem is that suppose a cursor gets created, and an out-of-band limit gets triggered even before the first value can be read and processed. In such a case, a RecordCusorStartContinuation would be returned.

The API user can then pass RecordCusorStartContinuation in a new transaction and then proceed to read values.

Similarly, when the cursor has been exhausted, RecordCusorEndContinuation would get returned, and if the API user tries to do a setContinuation using a value of RecordCusorEndContinuation, the onNext method would keep returning value of RecordCusorEndContinuation.

In such a scenario, is there a need to provide “free initial pass”?

If a scan returns a RecordCusorStartContinuation, then it didn’t make any progress at all, correct? What if it did that every time? The caller would be stuck. This is the situation the free pass tries to avoid. It might well be that nothing happens only sometimes, in which case there is (slow) progress overall. But we did not feel confident in that.

The serialization is a little confusing, I agree. But, when necessary, they can be disambiguated with another bit of information. That is what the first_exhausted field in a UnionContinuation is about. It’s perfectly legal, perhaps even a regular occurrence, for one branch of a union to be at the start because all the values returned so far came from the other branch(es). To reach this state, the free pass might have been used on that branch to see the values that it then decided not to return: this is fundamental to unions; they do more work when there are continuations than is representable and so that therefore needs to be repeated when called again. But the top-level continuation is never still at the start position.

Thanks for the reply! :slight_smile: That’s correct. Here are my tests for this. In my implementation, I’m referring to RecordCusorStartContinuation as key_value_continuation_v0_begin_marker_bytes. I am thinking of a continuation as a pointer/marker that exists between two key values. So, a range would consist of the number of keys covered by the range plus two (the begin marker and end marker). An empty range would consist of only a begin marker and end marker.

As you can see from the tests, when the begin marker bytes is passed, the range that gets created is no different to a range that gets created if no continuation was passed.

AFAICT, this will happen in case an out-of-band limit got triggered every time in a specific code path. At the compute layer since the goal is to have bounded transaction and query cost, I feel its the responsibility of the API user to have properly analyzed the cost of the query.

In the worst case scenario, the user facing web services API would get rate-limited or return an error, in which case, the developer can figure out what really went wrong and why the query was getting stuck.

It took me a while to figure it out :slight_smile: The continuation/cursor abstraction and effectively managing out-of-band limits seem to be a very important idea that layer developers need to be aware of. I am glad I invested the time to understand this design pattern. Once my implementation is complete, I am hoping to give a talk about this and my implementation in an upcoming meetup.

I’ve not yet gotten to the stage of understanding UnionContinuation abstraction. Thanks again for answering my questions. I really appreciate it! :slight_smile:

@MMcM I was wondering if you might be able to share the motivation for inclusiveNull and exclusiveNull tests in KeyValueCursorTest.java.

In what context would a user of the cursor API attempt to build such a cursor?

Since the design of my cursor is very similar to RecordLayer’s cursor design, I was able to create a similar test case in Rust, but was unable to understand the motivation for it when writing the test case.

If I am not mistaken, that block of tests was added to capture how the API behaved and guard against regressions. As opposed to all the cases it covers being genuinely useful.

In particular, exclusive high null is kind of weird. It might be equally reasonable, in something new, to give an error for this case. Some of the time it generates a malformed Range, that is, one in which the begin is not <= end. The Java binding doesn’t complain; you just don’t get any data.

Null for low is easy. In fact, it doesn’t much differ from the empty byte string. Null for high is trying to represent something that doesn’t have such a concrete representation, since there is no (inherent) limit on the length of a key.

Another possibility is that the code is trying too hard to have a meaningful distinction between inclusive and exclusive in this case. They differ by a single point; but that point does not exist in any real enumeration of key-value pairs.

1 Like

In terms of API design, you have a Subspace, meaning a key prefix understood as an interval, and the a byte string monad.

  1. Do away with endpoint types; begin is always inclusive and end is always exclusive. To get exclusive begin or inclusive end, use the byte string successor function. The None value represents the successor of the subspace prefix key itself (when converted to a Range by the key value cursor implementation).
  2. Accept both inclusive and exclusive endpoint type with None for that.
  3. Accept inclusive and exclusive is an error, on the grounds that every possible key is included.
  4. The other way around, on the grounds that the successor key is outside the subspace and so cannot be included.
1 Like

As I was studying the design of FDBRecordVersion type, in the documentation it says,

Together, this two versions are combined into a 12 byte array that can be used to impose a total order across all records.

From what I understand the C API level 10 byte “versionstamp” (transaction’s commit version (8 bytes) + transaction’s batch order (2 bytes)) can give us total ordering of transactions within a single FDB cluster.

We combine this information with transaction level ordering in the Tuple layer.

However, from what I can make out, this ordering is not guaranteed to hold when we move records between FDB clusters.

Therefore, wouldn’t we also need to somehow capture “generation” information here in order to have a total ordering of values of FDBRecordVersion type?

Is this information captured elsewhere in the RecordLayer?

There is currently no support for doing this correctly in Record Layer itself. There could be, but it is not required that it be there in the same way as FDBRecordVersion.

One scheme, in fact the one described in our SIGMOD paper, is to have a count of the number of cross-cluster moves. When saving, this is copied into a “system field” of the record, next to the in-cluster version. Then the total ordering is from an index that concatenates that counter with the version. Any record written after a move comes after all records from the previous place. The record data itself does not need to be affected by the move, except for the single “system record” that holds the counter (and similar information about the record store that informs catalogue-like lookups).

1 Like

As mentioned in the following post

It looks to me that RecordLayer does not track the 10MB limit for transactions and makes the higher level code responsible for handling “transaction_too_large (2101)” error. Just want to confirm that it is still the case?

While exceeding the 10MB limit would result in a transaction too large error, in our application, we could inadvertently exceed the 1MB limit, thereby causing a performance issue.

I was wondering if some mechanism exists within Record Layer and FoundationDB that would let us log such transactions, so we can investigate them later on?

The only mechanism that I was able to think of was to set the transaction option SizeLimit to 1MB, but that would fail any transaction larger than 1MB. Is there some option that would allow us to log transactions larger than 1MB, but not fail them?

Yes, that’s right: the Record Layer does not independently track the size of transactions, and instead, it will bubble up the transation_too_large error (as an FDBStoreTransationSizeException). The main reason for this is that the calculation for transaction size is a little delicate, and as it includes FDB internal data (like conflict ranges) that can be somewhat difficult to reason about, and so replicating the exact logic from FDB in a layer above can be error-prone.

That being said, the FDBRecordContext class does expose a method, getApproximateTransactionSize(), which asks the underlying transaction approximately how big the transaction already is. (It’s approximate because mutations and conflict ranges are coalesced by the FDB client (so, for example, if you set a key multiple times, only the final set is actually sent to the FDB server), and the approximate transaction size method can, for efficiency of calculation, overestimate the final transaction size by not taking future updates into account. It also returns a future, though it makes no network calls, though it does end up enqueuing a task on the FDB client’s task queue, so it can be delayed if the FDB client is too busy.) That can be used as a sort of upper bound to transaction size, so if you are doing a bunch of mutations where atomicity is flexible, you can use the transaction size to guard the size of each individual transaction. The Record Layer itself actually uses this during index builds, for example: fdb-record-layer/IndexingBase.java at 238a4f27d672082c122be2f402d698ca2f292dba · FoundationDB/fdb-record-layer · GitHub

In terms of logging large transactions, fdbserver already will log any large transactions with the LargeTransaction trace event: foundationdb/CommitProxyServer.actor.cpp at dd0036f09cac804dbfc8a5b2ee87881c3cdfad55 · apple/foundationdb · GitHub By default, it logs if the transaction exceeds 2 MiB, though you can adjust that knob via command line arguments to FDB server: foundationdb/Knobs.cpp at dd0036f09cac804dbfc8a5b2ee87881c3cdfad55 · apple/foundationdb · GitHub There are also client-side logs, which use the same knob, but they have a bit more useful information including the transaction’s debug ID: https://github.com/apple/foundationdb/blob/dd0036f09cac804dbfc8a5b2ee87881c3cdfad55/fdbclient/NativeAPI.actor.cpp#L6668 So, assuming you have client trace logs set up (configured by FDBDatabaseFactory::setTrace) and are setting the transaction ID to something that you can then use to correlate with own logs (via FDBRecordContextConfig::setTransactionId), you should be able to look for large transactions. Note that changing the limit on the client-side logs is difficult, because client knobs aren’t really exposed, unlike server knobs.

1 Like

Thanks @alloc for the reply and Happy New Year 2023! :slight_smile:

As I was exploring the design of Rust equivalent of FDBRawRecord type, I noticed that it might be beneficial to have a simple schema/type system for specifying the semantics of a tuple.

What got me thinking in this direction was that not all tuple forms would be “valid” in all circumstances and it would be better to disallow creation of invalid tuple forms or when invalid tuple forms are read from the database, catch that error as early as possible.

For example:

  • Having a value of null (unit) type or a value of Versionstamp type does not really make sense, for a primary key tuple. null is used to indicate a “missing value”. Primary key for a missing value, is a bit weird.

  • For indexes, we can have null value. But it might be beneficial to know the nature of null. Is it a missing value of an Integer, ByteArray, String, etc.,?

I’ve therefore created two types TupleSchemaElement and TupleSchema and a simple function to check the validity of a tuple to its schema.

I was wondering if this design was previously explored by the Record Layer team. If so can you please let me know what pros and cons were discussed?

A couple of observations on your examples, which I realize are not central to your feature. First, there are subtleties, but it might be useful to have versionstamps in primary keys, even though Record Layer does not support it today. Second, Record Layer does support null as a primary key element, as well as as an indexed value. This is less about having “unknown” information there and more as a distinguished member of every type domain. This turns out to sometimes be easier to deal with than special casing the empty string or zero. It doesn’t come for free, though. We had to introduce null-standins and the distinction between unique and non-unique nulls. Which, interestingly enough, Postgres 15 added to their SQL.

To your main point, yes, I think that is a good idea. The closest thing you will find today is ...cascades.typing.Type.Record, which is used by the Cascades planner to reason about both records and their materialized view index projections.

I think that there are a few reasons why we haven’t had it in the mainline yet.

  • We don’t treat streams of index values as first-class results. This is a mistake; you’ll find a bunch of PR comments referring to this. For covering index scans, we squeeze the values the index does have into the original record structure. Which is particularly odd when it comes to repeated nested fields because we only populate the one(s) that is in the index. Scans should be streams of records in the type sense above, not in the strict Protobuf sense, although that is an important special case.
  • Java does not have first-class typed tuples, other, perhaps, than Pair<x,y>.
  • Java does have just-good-enough reflection that you can postpone reasoning about things to pretty late at runtime.

I don’t know what you are thinking about for your runtime execution. But I wonder whether you might want to make sure that your tuple schema is compatible with native tuples. As far as I know, the Rust FDB bindings do not have serde-style serialization and deserialization of tuples into corresponding structs. But it could, I think. Along with prost-generated protobuf.

Record Layer has some modest type safety along these lines with typed record stores, but they are sometimes challenging to work with because Java generics sometimes are.