Record Layer Design Questions

I am currently exploring the design of the Record Layer from the ground up.

Rather than create multiple threads with my questions, I am hoping to create one thread.

My hope is that discussions in this thread will be helpful to me for my Rust layer and also for future layer developers as they try to build layers leveraging ideas from the Record Layer.

In the EndpointType type, there are two variants that are named TREE_START and TREE_END. I was wondering if there was any specific reason for naming the variants with the prefix of TREE_?

Did FoundationDB in the past have a limit on the number of key-values that could be read in a transaction?

The reason I ask is because RecordLayer includes a RecordScanLimiter as a “out-of-band” limiter.

While I understand the need for ByteScanLimiter and TimeScanLimiter given FoundationDB design imitations, was wondering if there was a reason for RecordScanLimiter to exist as an out-of-band limiter?

What TREE_ needs to indicate is that it’s something independent of the range keys passed to the API, and, moreover, the lowest / highest such thing, in context. And since that context might be an index or a the primary extent, or really any other subtree in the record store, something neutral is required.

1 Like

Records (or “rows”) scanned is a pretty good proxy for the amount of work done, particularly by a query. The current design, based partly on the underlying limitations of the key-value store, but also on a requirement to support many tenants with pretty high throughput, is that you must either limit the work done in that way, or else license the system to stop and pick up again later with even lower per-transaction thresholds.

Specifically, the number of keys scanned is a pretty good indicator of poor index definition / selection, that might still slip in under a byte limit (all the index entries / records are small) or time limit (you can get a lot read in a few seconds).

1 Like

For the setLow and setHigh methods on an instance of KeyValueCursor.Builder class, is there a scenario wherein we might set the argument lowEndpoint or highEndpoint to EndpointType.CONTINUATION?

Even though the API allows for this possibility, I was unable to think of a scenario where this could be useful. Therefore wanted check if I was missing some use-case.

I think you’re right. That endpoint type emerges in the building of the key-value cursor by KeyValueCursor.Builder.setContinuation. It’s really kind of internal, but that’s hard to do without having two enums, which seemed like more work than it was worth.

1 Like

In the design of CursorLimitManger, there is a notion of a “free initial pass”.

I was wondering if the reason why Record Layer needs to provide a free initial pass is because, the serialized form of RecordCursorStartContinuation and RecordCursorEndContinuation are the same?

As a consequence of the “free initial pass”, if a continuation is returned, it can be assumed that only RecordCursorEndContinuation would be returned.

In my design, I am explicitly modeling the notion of a start continuation and end continuation that can be returned by the equivalent of RecordCursorResult<T>. The way I am thinking about this problem is that suppose a cursor gets created, and an out-of-band limit gets triggered even before the first value can be read and processed. In such a case, a RecordCusorStartContinuation would be returned.

The API user can then pass RecordCusorStartContinuation in a new transaction and then proceed to read values.

Similarly, when the cursor has been exhausted, RecordCusorEndContinuation would get returned, and if the API user tries to do a setContinuation using a value of RecordCusorEndContinuation, the onNext method would keep returning value of RecordCusorEndContinuation.

In such a scenario, is there a need to provide “free initial pass”?

If a scan returns a RecordCusorStartContinuation, then it didn’t make any progress at all, correct? What if it did that every time? The caller would be stuck. This is the situation the free pass tries to avoid. It might well be that nothing happens only sometimes, in which case there is (slow) progress overall. But we did not feel confident in that.

The serialization is a little confusing, I agree. But, when necessary, they can be disambiguated with another bit of information. That is what the first_exhausted field in a UnionContinuation is about. It’s perfectly legal, perhaps even a regular occurrence, for one branch of a union to be at the start because all the values returned so far came from the other branch(es). To reach this state, the free pass might have been used on that branch to see the values that it then decided not to return: this is fundamental to unions; they do more work when there are continuations than is representable and so that therefore needs to be repeated when called again. But the top-level continuation is never still at the start position.

Thanks for the reply! :slight_smile: That’s correct. Here are my tests for this. In my implementation, I’m referring to RecordCusorStartContinuation as key_value_continuation_v0_begin_marker_bytes. I am thinking of a continuation as a pointer/marker that exists between two key values. So, a range would consist of the number of keys covered by the range plus two (the begin marker and end marker). An empty range would consist of only a begin marker and end marker.

As you can see from the tests, when the begin marker bytes is passed, the range that gets created is no different to a range that gets created if no continuation was passed.

AFAICT, this will happen in case an out-of-band limit got triggered every time in a specific code path. At the compute layer since the goal is to have bounded transaction and query cost, I feel its the responsibility of the API user to have properly analyzed the cost of the query.

In the worst case scenario, the user facing web services API would get rate-limited or return an error, in which case, the developer can figure out what really went wrong and why the query was getting stuck.

It took me a while to figure it out :slight_smile: The continuation/cursor abstraction and effectively managing out-of-band limits seem to be a very important idea that layer developers need to be aware of. I am glad I invested the time to understand this design pattern. Once my implementation is complete, I am hoping to give a talk about this and my implementation in an upcoming meetup.

I’ve not yet gotten to the stage of understanding UnionContinuation abstraction. Thanks again for answering my questions. I really appreciate it! :slight_smile: