How to deal with CLIENT_BUGGIFY return error but simulateTimeoutInFlightCommit success?

Hello @andrew.noyes,
There is a small question about CLIENT_BUGGIFY, when we use c-api and open CLIENT_BUGGIFY, we find the c-api return timed-out error after commit, but there was a simulateTimeoutInFlightCommit with ACTOR model.

So could you give us some suggestion to deal with this kind of problem, cause the inflight commit may success and change the key’s value but client receive error code?

Many thanks for your suggestion!

Reasoning about possible behaviors in this scenario is very tricky. The simplest way to handle this all correctly (and what I highly recommend) is don’t retry a transaction whose commit future failed with transaction_timed_out. This is also why transaction_timed_out is not retried by the default retry loop.

Also there was actually a bug that was only recently fixed in the implementation of simulateTimeoutInFlightCommit itself (https://github.com/apple/foundationdb/pull/8416), that means that your test can fail even if you are handling this correctly.

The basic problem is that if you get transaction_timed_out from a commit, then you have very little information about what has happened or what can happen. The commit attempt may have already succeeded, may succeed in the future, or may never succeed. The c api internally (before throwing commit_unknown_result), will guarantee that either the commit has already succeeded or will never succeed. Doing this requires getting a response from the cluster though, so it could take an indefinite amount of time. Setting a transaction timeout basically subverts this mechanism, and it kind of has to - the alternative is waiting indefinitely, and avoiding that is the whole point of setting a timeout.

If you want to retry a transaction whose commit failed with transaction_timed_out, the only safe way to do it is to somehow ensure that the last commit attempt is no longer in-flight. All mechanisms for doing this require getting a response from the cluster though, so it would have been more efficient and safer to not set a transaction time out at all and to just way for the c api to figure this out for you.

2 Likes