Returning conflicting read/write conflict ranges after transaction conflict

Debugging transaction conflicts is currently fairly difficult. The best way I’ve found so far (other than thinking really hard about what happened) is to (1) enable client trace logs on all operations and (2) enabling transaction logging on all transactions (3) processing client logs and manually running the conflict algorithm for all of the transactions found (which requires having all available transactions to get it correct). This is problematic as, especially in real environments, one might not have access to all conflict ranges from all transactions (as that can be a lot of data for reasonable workloads).

One proposed solution that I’d kind of like to see if people have any thoughts on is to return conflict information from the resolver back to the client. I believe the heart of the conflict resolution happens here:

At this point, the resolver knows (1) the failing read conflict range from this transaction, (2) the read version, (3) the commit version at which the range was changed, and (4) an upper bound of the mutation range that caused the operation to fail. I think all of those would be useful in debugging transaction conflicts (except maybe the transaction read version, though the client already should know that).

Knowing the failing read conflict range would be a big help by itself. Then the user can use this information to debug what might have happened based on domain knowledge of their data model. For example, in the Record Layer, knowing which range failed might be useful in determining whether it was a single record in the read set that was updated that caused the failure, or if concurrent writes would have caused a uniqueness constraint to be violated, or if the store header was updated due to some meta-data upgrade.

Knowing the commit version and mutation range can then be useful for correlating what happened to other operations going on at the same time. This is especially true if all transactions are being logged, but they have their uses in other instances as well.

The next question would be how to expose this to the user. The quickest way would be to log it in the client trace logs, maybe only if the user has set the “debug ID” option. Then it is up to the user to actually enable that logging. This has the disadvantage that the information must then be fished out of the log, but it would be available somewhere and wouldn’t require any API changes at the FDB client level.

All else being equal, the “best” option would be to make the exception class in each bindings beefier to include additional information. For transaction conflicts, you could then include this information as methods in the exception (maybe lazily marshaling from the C client?). This doesn’t quite gel with how those classes are structured at the moment though.

You could also imagine methods on the transaction object that allowed the user to query the transaction for what happened (and then do…something when the transaction hasn’t committed yet or the transaction succeeded). You could imagine a higher-level layer on top of the key-value store bindings would then package up the information from those methods with a transaction conflict exception so that the user could ask the exception for more details into what happened.