Proper transaction retry loops

Howdy,

I’ve finally gotten around to playing with the new client buggify feature and its managed to turn up an issue I’m not immediately 100% sure on how to address.

In the Erlang bindings we have a loop [1] that looks like such:

do_transaction(?IS_TX = Tx, UserFun) ->
    try
        Ret = UserFun(Tx),
        case is_read_only(Tx) andalso not has_watches(Tx) of
            true -> ok;
            false -> wait(commit(Tx))
        end,
        Ret
    catch error:{erlfdb_error, Code} ->
        put(?ERLFDB_ERROR, Code),
        wait(on_error(Tx, Code)),
        do_transaction(Tx, UserFun)
    end.

This was based off the similar retry loop in the Python bindings [2]. The interesting issue that buggify has turned up is the wait(on_error(Tx, Code)) will occasionally raise 1007/1009 errors. And given how that loop works it basically just short circuits the retry logic since it’s not protected by that try/catch.

My question is whether this a synthetic error or if there’s really some way that on_error would return a future that’s capable of throwing either of those errors. And if those version errors are real, how would a client be expected to react to such a situation other than some weird recursive retry-retry loop?

[1] https://github.com/apache/couchdb-erlfdb/blob/01f79dbb3938ceac4638752953b84a396d78c0ce/src/erlfdb.erl#L653-L665
[2] https://github.com/apple/foundationdb/blob/5876d694cbfda19cc4229fe4ca85b33d87fb81ea/bindings/python/fdb/impl.py#L264-L270

I was just pointed at the docs by a coworker where the on_error mentions:

If the resulting future is itself an error, destroy the future and FDBTransaction and report the error in an appropriate way.

So I guess that means we should expect the possibility here. But that seems a bit awkward given the nature of 1007/1009 being retryable.

I believe the only reason you should see 1007 or 1009 thrown from on_error is if you have a retry limit set, in which case the last error reported will be thrown when the retry limit is reached.

Otherwise, I don’t believe it is supposed to throw these errors. If the above isn’t your problem, it sounds like something we should investigate.

Good call! That was the issue. I’ll have to add some more error handling for those but I can adjust the tests easily enough to really ratchet up the retry counts when running in buggify mode.