Transaction Cancelled Errors

We’ve been facing a lot of Transaction Cancelled errors (Operation aborted because transaction was cancelled: 1025) lately. We have identical, unique transactions of ~500 KB each being transacted by 6 Java clients in a loop. We do not have any timeout options set for these transactions. We were running these with a performance testing setup in mind, and the write rate observed was ~25000 kHz.

Our cluster configuration:

  • 5 machines
  • triple replication
  • 1 SSD per machine

The process class configuration:

  • Machine 1: 2 storage + 1 proxy + 1 stateless

  • Machine 2: 2 storage + 1 proxy + 1 stateless

  • Machine 3: 2 storage + 1 proxy + 1 stateless + 1 log

  • Machine 4: 2 storage + 1 log + 1 stateless

  • Machine 5: 2 storage + 1 log + 1 stateless

Interestingly enough, our logs suggest that some of these transactions were already committed when the said error was encountered.

Any insight into this would be helpful!

transaction_cancelled can get thrown by an operation for a few reasons:

  1. You call cancel on your transaction.
  2. The transaction is destroyed while an operation is outstanding. For example, if you start a read and don’t wait for the result, closing the transaction may cause that operation to throw this error.
  3. You reset your transaction (I don’t think this is possible in Java).
  4. You retry a transaction that has outstanding operations (using onError or the default retry loops).
  5. If you happen to be running an API version before 410, then I think commit could put the transaction into a state where any subsequently started operations may be cancelled.

If you have transactions that are successfully committing and then throwing this error in some operation, then I suspect what’s happening is you are hitting #2. In the case that you commit a transaction, though, it will wait for outstanding reads to complete before the commit can succeed. That would mean that in order to trigger this case, you would need to be starting operations after the commit has started, and it would be these operations that would see the error.

Is that something that is plausibly happening in your application?