How Record Layer Provides APIs to Handle Large Range Scan Longer Than 5 seconds

jltz · October 21, 2020, 9:54am

Hi,

Based on what I understand from the FDB Java Binding APIs, for a getRange() query that can potentially run over 5 seconds and encounter the “transaction too old” exception, I can capture this transaction exception with code 1007, https://apple.github.io/foundationdb/api-error-codes.html, and create a new transaction and then set the new query range that starts from the last key before the transaction. With this mechanism, I can continue the long-range scan.

I am trying to look into the Record Layer Java Library on how it exposes systematically the APIs (with continuations) and composable iterators (or cursors), so that I can compose iterator-based pipeline processing for the key-value retrieved from the FDB getRange() queries.

In the FQA section, https://github.com/FoundationDB/fdb-record-layer/blob/master/docs/FAQ.md, I see the statement “All streaming operations in the Record Layer support continuations… so restarting the same operation with a new transaction and a continuation can allow the operation to continue where it left off.”

My questions are:

(1)does the record layer provide the APIs that handle “restarting the same operation with the new transaction”? if so, can I get the link to the code in the record layer that handle this?

(2) or the application will need to capture the exception, and start a new transaction, along with the “continuation” object retrieved from the failed transaction? If so, how will the application code look like?

(3) With iterator (or cursor) chaining in a processing pipeline, if during one of the iterators processing, transaction becomes too old and the exception gets raised, and transaction gets restarted (either by record layer or by application), how can the remaining chained iterators (cursors) be resumed after the transaction gets restarted? My understanding is that an iterator (may be cursor as well) is tied to a transaction object and no way to resume from a new transaction.

scgray · October 22, 2020, 6:48pm

(1)does the record layer provide the APIs that handle “restarting the same operation with the new transaction”? if so, can I get the link to the code in the record layer that handle this?

There isn’t a generic API that exposes continuing any given operation, but all streaming operations individually accept a continuation that will cause them to resume streaming at the point at which the continuation from the previous operation was started. For example:

github.com

FoundationDB/fdb-record-layer/blob/2432438481709b7108eb800d30adbc46544dc2a3/fdb-record-layer-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/FDBRecordStoreBase.java#L670


      
           * @param low low point of scan range
           * @param high high point of scan point
           * @param lowEndpoint whether low point is inclusive or exclusive
           * @param highEndpoint whether high point is inclusive or exclusive
           * @param continuation any continuation from a previous scan
           * @param scanProperties skip, limit and other scan properties
           *
           * @return a cursor that will scan everything in the range, picking up at continuation, and honoring the given scan properties
           */
          @Nonnull
          RecordCursor<FDBStoredRecord<M>> scanRecords(@Nullable Tuple low, @Nullable Tuple high,
                                                       @Nonnull EndpointType lowEndpoint, @Nonnull EndpointType highEndpoint,
                                                       @Nullable byte[] continuation,
                                                       @Nonnull ScanProperties scanProperties);
          
          /**
           * Count the number of records in the database in a range.
           *
           * @param low low point of scan range
           * @param high high point of scan point
           * @param lowEndpoint whether low point is inclusive or exclusive

without a continuation, it will start scanning at the beginning of the range specified (or end, depending on direction). With a continuation, it will resume at the point the continuation represents. It is important that the arguments to the operation be identical from call to call, with the exception of the continuation itself.

(2) or the application will need to capture the exception, and start a new transaction, along with the “continuation” object retrieved from the failed transaction? If so, how will the application code look like?

The continuation isn’t exposed in the exception itself, but all streaming operations return a RecordCursor. The .onNext() method returns a RecordCursorResult object. This encapsulates the next fetched value (if present), the reason the scan stopped, called the NoNextReason (if no value is present to be fetched), and a RecordCursorContinuation that will continue the scan immediately following the value just returned.

There are two strategies you can take to handling the continuations:

Always keep the continuation around from the most recently returned value. After receiving your exception, re-start the operation with that continuation.
Set limits in the ExecuteProperties which restricts the scan to some amount that makes it unlikely to hit the exception, such as limiting the scan to a set number of milliseconds. Then, when the scan ends, and the NoNextReason indicates that there are more pending results, you use the current continuation to restart the operation.

Here is an example from one of the unit tests using a query along with continuations and limits:

github.com

FoundationDB/fdb-record-layer/blob/187cfbd68417d4cc412f5db9145aa4655994ef2c/fdb-record-layer-core/src/test/java/com/apple/foundationdb/record/provider/foundationdb/limits/FDBRecordStoreScanLimitTest.java#L230


      
              }
              for (int limit = maximumToScan + 1; limit <= 100; limit++) {
                  assertNumberOfRecordsScanned(maximumToScan, plan,
                          ExecuteProperties.newBuilder().setFailOnScanLimitReached(fail).setScannedRecordsLimit(limit).build(),
                          "should not be limited by record scan limit");
              }
          }
          
          @ParameterizedTest(name = "plansByContinuation() [{index}] {0}")
          @MethodSource("plansWithoutFail")
          public void plansByContinuation(String description, boolean fail, RecordQueryPlan plan) throws Exception {
              int maximumToScan = getMaximumToScan(plan);
          
              // include a scanLimit of 0, in which case all progress happens via the first "free" key-value scan.
              for (int scanLimit = 0; scanLimit <= maximumToScan * 2; scanLimit = 2 * scanLimit + 1) {
                  final Function<FDBQueriedRecord<Message>, Long> getRecNo = r -> {
                      TestRecords1Proto.MySimpleRecord.Builder record = TestRecords1Proto.MySimpleRecord.newBuilder();
                      record.mergeFrom(r.getRecord());
                      return record.getRecNo();
                  };

There is also a cursor designed to make this sort of iteration easier, called the AutoContinuingCursor. It’s general principal is that you hand it a generator (a lambda) that takes an FDBRecordContext and a continuation, and it will call it repeatedly, as necessary, to continue your operation. There is an example of using it in the AutoContinuingCursorTest (sorry, it doesn’t have many examples of its use). It also takes advantage of using the FDBDatabaseRunner which helps with retrying of failed operations.

jltz · October 23, 2020, 10:45am

@scgray: thank you very much for the detailed explanation and code illustration! The following is what I learned from looking into record layer code based on the code examples that you showed above:

(1) the exception on “transaction too old” is defined as a retryable exception, in “FDBExceptions.java”.
(2) FDBDatabaseRunner, defined in FDBDatabasRunnerImpl.java, has a private class called “RunRetriable”, and the whole retry logic is implemented in the method called “handle”. In particular,
https://github.com/FoundationDB/fdb-record-layer/blob/2432438481709b7108eb800d30adbc46544dc2a3/fdb-record-layer-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/FDBDatabaseRunnerImpl.java#L243.
(3) FDBRunner’s runAsync https://github.com/FoundationDB/fdb-record-layer/blob/2432438481709b7108eb800d30adbc46544dc2a3/fdb-record-layer-core/src/main/java/com/apple/foundationdb/record/provider/foundationdb/FDBDatabaseRunnerImpl.java#L305, runs in a loop “AsyncUtil.whileTrue()” that in each iteration, a new transaction object is created and invoke “handle” described above.

Following your suggestion, it seems that for range scan, what needs to specify is the scan limit, but no need for the scan time, as the 5-seconds limit will lead to the transaction-too-old transaction which is a retryable transaction, and then get handled automatically by FDBDataabseRunner (with max attempts configurable).

Is the flow that I described above correctly, to handle transaction-too-old exception by the Record Layer?

Following the transaction retry logic above, what about my third question that I raised earlier: on how to handle iterator (or cursor) chaining in a processing pipeline? Let’s use the example of: FlatMapPipelinedCursor, that accepts two cursors. Supposed the first cursor encounter transaction restart internally, (1) does the second cursor share the same transaction object defined in FoundationDB Runner? (2) if so, how the second cursor being forced to restart, as it is running fine without exception when the first cursor triggers the exception?

scgray · October 26, 2020, 1:34pm

Hi @jltz. I think the problem that you’ll hit with the AutoContinuingCursor and its use of the FDBDatabaseRunner is that the retry loop will retry the current transaction on transaction_too_old, it will not move forward to the next point in the continuation. Part of the reason for this is FDBDatabaseRunner doesn’t know what work you were doing, and if the transaction contained writes, then by hitting that error, you would not have committed the writes and, thus, should not be moving forward (perhaps the AutoContinuingCursor would be updated to change that behavior, but technically, you can do writes within its transactions). So, I would suggest always specifying a time limit if you want to use the AutoContinuingCursor, or you can roll your own similar logic that treats transaction_too_old differently.

how to handle iterator (or cursor) chaining in a processing pipeline? Let’s use the example of: FlatMapPipelinedCursor, that accepts two cursors

The proper way to create a FlatMapPiplinedCursor is with RecordCursor.flatMapPipelined, like so:

RecordCursor<X> cursor = RecordCursor.flatMapPiplined(
    (continuation) -> { ...create outer cursor from continuation... },
    (continuation) -> { ...create inner cursor from continuation... },
    continuation,  /* null if you don't have a continuation yet */
    10 /* pipeline size */);

When iterating, the RecordCursor will produce continuations that represent the combined continuation state of the outer and inner cursors. This construction method will take this combined continuation, decompose it into the two outer/inner continuations, and call the lambdas provided to reconstitute those cursors in the correct state.

An example can be found here: https://github.com/FoundationDB/fdb-record-layer/blob/f034a282e40428c450e82cd72d902fd73b6e5cdb/fdb-record-layer-core/src/test/java/com/apple/foundationdb/record/RecordCursorTest.java#L514

Supposed the first cursor encounter transaction restart internally,

That would imply the outer cursor was in a different transaction than the inner which seems like it would be a bad idea, however I think it wouldn’t really matter. The overall state of the FlatMapPiplinedCursor is really just the combined continuations of the two cursors, and the fact that those continuations came from different transactions shouldn’t really matter (transactional consistency concerns aside).

(1) does the second cursor share the same transaction object defined in FoundationDB Runner?

It depends on how you constructed the outer/inner cursors. If each of them are using their own FDBDatabaseRunner, then they would each have their own transaction and their own retry loop. Again, I think this is probably a Bad Idea™, but I guess one could invent a reason to want to do it. But, if you are encapsulating the construction of the FlatMapPiplinedCursor itself and, thus, the inner/outer cursors within a runner, then provided you pass both of those cursors the same context, they are now in the same transaction (which, IMHO, is what you should be doing).

(2) if so, how the second cursor being forced to restart, as it is running fine without exception when the first cursor triggers the exception?

The continuation returned from the FlatMapPiplinedCursor is the combined continuations of the most recently returned values of the inner/outer cursors, thus it is always a valid continuation, whether or not either of the cursors were to go on to subsequently throw an exception on the next fetch.

alloc · October 26, 2020, 4:41pm

Regarding automatically restarting cursors on transaction_too_old:

I think the things discussed in this thread are an accurate description of the current state of cursors, but I’ve also thought about whether we should add something that allows for automatically executing cursors across (retriable) exception boundaries.

I could see it happening one of two ways:

We add a new NoNextReason that indicates that a retriable exception was hit, and if we hit transaction_too_old and friends, we return that instead of propagating the exception up, and each cursor terminates (just as if we hit a limit). Then we can use existing resumption logic to resume the cursor
We add an option to the AutoContinuingCursor to catch retriable exceptions. It would need to keep track of all intermediate continuations in this case (but I think it already does), and then if it hits an error, it can resume the inner cursor from the last continuation just as if it had hit a limit.

As is mentioned above, you’d definitely only want one auto-continuing cursor in your cursor chain to avoid having multiple cursors using different databases (e.g., you wouldn’t want the children of a FlatMapPipelinedCursor to be AutoContinuingCursors…), and I suppose the semantics of this make the most sense when the operations are read-only. (You might be able to make something like this work with read/write transactions, but it would be messy.)

iamquang95 · December 10, 2020, 3:25am

I’m writing a large range scan, i split the large transaction to many small transactions. For each small transaction, i apply a function to modify records that are returned by this transaction. I’m wondering is there any chance that the modified record is re-visited in later transactions?

One more question, what if i’m in the middle of a large range scan and there are some new records added to database. Will i visit these new records?

scgray · December 16, 2020, 5:43pm

For all of the answers below, I am assuming that you are talking about a simple range scan over a single range of keys. The answers get more complex if we are talking about operations involving more complex cursor operations, such as joining different scans with mapped pipelines.

I’m wondering is there any chance that the modified record is re-visited in later transactions?

If the modification involves modifying the row key such that it will appear after logically later in the scan, then yes you could see it again. If you are not modifying the row key, then no you will not see it again.

what if i’m in the middle of a large range scan and there are some new records added to database. Will i visit these new records?

If there are new records that are added to the database in the key range that you have not yet reached in your scan, then yes you will see them. Records added in prior ranges you have visited will not be seen and, since transactions are serializable in FDB, you will not see records added within the range currently being scanned if they are added after your transaction has begun.

At its heart, a scan is just a range scan, meaning that it is simply scanning a range of keys in either an ascending or descending direction. A continuation is really just a record of the key that you last read so when you resume with a continuation, you are really just asking to start a new scan immediate after (for ascending scans) or before (for descending scans), the requested key. When you think about things in these terms, it gets easier to reason about which keys you may or may not see as you continue your scan.

Topic		Replies	Views
Large Range Scans - avoid 5s limit Using FoundationDB	34	7805	September 12, 2022
Scanning a large range with Locality API hangs Using FoundationDB	12	1952	August 12, 2020
Record Layer Design Questions FoundationDB Layers	21	1669	July 20, 2023
Rust FDB Record Layer Work-in-progress Repository FoundationDB Layers	1	1550	January 26, 2023
Consistency of queries using continuation cursor Record Layer	5	1543	October 23, 2019

How Record Layer Provides APIs to Handle Large Range Scan Longer Than 5 seconds

Related topics