What does it mean "Transaction Is Retryable", In particular to Handle Error of "Transaction Too Old"?

I am using FDB Java Binding to develop the Java application. The application invokes getRange() call and checks the FDBException raised. If the method call “isRetryable” from the raised FDBException returns true, the application will use the same FDBTransaction object to re-invoke getRange() call again. The application controls the number of the retries based on the given time budget.

I just found that my application encounters “Transaction Too Old” (that is,Transaction is too old to perform reads or be committed, with error code 1007), likely due to heavy workloads injected. And according to the call returns from fdb_c.cpp’s method call:

fdb_bool_t fdb_error_predicate( int predicate_test, fdb_error_t code ) {

return code == error_code_not_committed ||
code == error_code_transaction_too_old || …
}

The error of “transaction too old” is retry-able. With my application logic, it ends up that my code keeps invoking the getRange() call, and every time the same error “transaction too old” returns, until the code runs out of the retry time budget (say, 10 seconds). Clearly, my retry logic is incorrect to handle this particular error.

I checked the entire Java binding Package of 6.0.15 and 6.1.8, to try to find the right way to handle such “retry-able” errors. However, there are really no function calls used in the Java Binding that explicitly makes use of FDBException’s isRetryable method call.

What I found is the run() or runAsync() method defined in FDBDatabase.java. Whenever a FDBException is raised, the associated FDBTransaction object will invoke its onError() method to return a new Java FDBTransaction Object (that wraps around the native C’s transaction object’s pointer, via the “transfer” method). The call to the newly created Java Transaction object will continue, until no runtime exception raised. However, in both method implementations for run() and runAsync(), the “isRetryable” call is not invoked at all.

If I follow the run() or runAsync()'s retry logic + transfer the native C++ object pointer from the old transaction to the new transaction, to handle “transaction too old”, the retry logic would never succeed, because it will use the same underlying native C’s transaction object. Once “transaction too old” happens, over the same native transaction object, it will always return the same error code.

My questions are the following:

(1) In FDBDatabase’s run() or runAsync() method, when the error “transaction too old” occurs, what is the termination logic to terminate the method call, as the retry will keep return the same error and thus keep raises the same FDBException (subclass to RuntimeException)?

(2) What is the recommenced way to handle “retry-able” transaction, with some pseudo code to show the implementation logic?

(3) why not just create a brand-new transaction object, to handle for example, getRange() call, without transferring the native C pointer as onError(.) does. My view is that what really get saved is to avoid getReadVersion(.) , if we keep re-using the old C transaction object. But getReadVersion() is typically very fast, in the order of ~1 ms or less. For error handling (the rare situation), saving such a short-latency call is not worth.

An error being retryable indicates that the entire transaction can be retried in response to the error. The standard way to do this is with the built-in retry loops (e.g. run and runAsync in Java) that will reset the transaction and start over if you encounter a retryable error.

You can also roll your own using Transaction.onError, which checks the retryability of an error, introduces a backoff delay, and resets the transaction before returning (or, if you get an error that isn’t retryable, it just rethrows the error). See https://apple.github.io/foundationdb/developer-guide.html?highlight=retry%20loop#transaction-retry-loops for an example of what a retry loop looks like implemented in terms of on_error in Python.

It’s also possible to write your own retry loop without using onError at all by appropriately handling the various error cases, which I think is part of what the predicates were intended to help with. I think if you go this route, the standard retry and timeout options may not function as intended anymore. I would advise against this approach unless you have a particular use-case that can’t be achieved without it.

In Java, we have an extra step involved with the native object handoff. The point of this is to avoid cases where a transaction gets reset while some holder of the Java object is unaware. By excising the native object from the Java transaction, we prevent unintended usages by these other holders. For example, one may have an asynchronous task that reads a range from a transaction and then writes a result into it. If the transaction was reset while that asynchronous task were running, it could inadvertently read and write into the newly reset transaction based on reads done in an old transaction, which is generally not what you’d want.

To answer your question about why we don’t just create a new transaction – there is some state that we want to maintain for each transaction through the retries. In particular, this would be information about the current backoff delay (it grows exponentially) as well as state for the timeout and retry limit, as alluded to above. Using onError properly retains and advances this state. Also, every retry requires us to get a new read version, so this approach doesn’t help us avoid a getReadVersion call.

Thanks for the detailed explanation. Could you elaborate on the actions involved in “reset transaction” in the current implementation of FDBTransaction onError() method? What I see from the FDBTransaction’s onError() implementation is to make a transfer of the underling native Transaction Object Pointer to the new Java Transaction object. I did not see any reset related action. So I assume that the real transaction object that is managed at the C++ level does not get created for a new one.

But if the C++ transaction object does not get created, how can the error of “Transaction too old” be fixed by invoking the same getRange() call to the same C++ transaction object via the Java wrapper?

The onError implementations in the bindings are just wrappers of the actual call that happens on the native transaction. In Java, you can see that call here: https://github.com/apple/foundationdb/blob/27c322d55efd99dbd14e669d23551455693eb49e/bindings/java/src/main/com/apple/foundationdb/FDBTransaction.java#L539

This native call is what resets the transaction back to a mostly original state.

Does “reseting the transaction back to the mostly original state” include getting the new “read version” for this reseted transaction (in the C++ implementation)? as otherwise, “Transaction too old” will be still be too old in the reseted transaction.

Reseting the transaction can be thought of as being very similar to creating a new one. The only differences are with things like current backoff state and transaction timeout/retry state.

This means that a reset transaction does not have a read version at all (just like a newly created transaction) and will get a new one once you call getReadVersion or try to read or commit with it.