Relax consistency guarantees

Is it possible to relax the consistency guarantees of transactions? From my current understanding, writes must be durable on all replicas before the transaction is committed, thus impacting latency. Snapshot reads are available but they only relax isolation guarantees.

As far as I think, your understanding is wrong. The only thing need to be durable before committing is transaction log. Which should(If I remember correctly) be an append-only log file. Updating an append-only log file is very fast.
So for failure tolerance, we not only need to keep at least one of the three storage servers alive, but also need to keep at least one of the three transaction log servers alive.

A commit needs to be persisted to disk on multiple tlog servers (same number as replication factor) before transaction can be acknowledged. I am not sure if there is any requirement for any of the storage servers to be alive at time of transaction commit.

To amortize the cost of transaction commits, proxies batch multiple transactions happening over small durations.

If you want to improve latency at the expense of consistency, you can use CAUSAL_READ_RISKY (which makes getting a read version slightly cheaper), or avoid getting new read versions frequently by manually setting the read version for some of your transactions. You might read a stale value, but you still won’t be able to commit a transaction that read a stale value (unless you opt out of that too by messing with read conflict ranges/snapshot reads)

For acknowledging before a commit is durable on all transaction logs, you can configure a non-zero “tlog anti quorum”, but I don’t think anyone actually does this or recommends doing it because it could allow a tlog to fall behind.

Thanks for pointing these out. Aside from disk latency, is there any way to decrease latency caused by the network at the expense of consistency? (Something similar to Cassandra’s tunable consistency.)

If you run a high enough load and don’t expect a high conflict rate, you could do the following:

  • Keep a shared atomic in memory for maxVersion and initialize it with 0
  • After each commit, get the committed version and set maxVersion to the max of the committed version and your version (atomically)
  • When you start a transaction, check whether maxVersion > 0 - if it is set the read version explicitly to maxVersion.
  • If any transaction fails with past_version, set maxVersion to 0 - this will cause the next starting transaction to fetch a version again (this can happen either because your load was very low or if there was a recovery).

If you do this, you will lose some external consistency guarantees - but all your transactions will still be serializable. Reading will then be very close to optimal (1 network roundtrip per read as soon as your local cache for storage server locations is warmed up and until you have failures or partitions are moved).

To speed up writing you can batch transactions together. If you are willing to give up consistency you can speed up your writes like this (although it will be a bit tricky to get right):

  • Only start read-only transactions to read, never write anything in the same transaction from where you read.
  • Have a separate write transaction where you send your writes to.
  • Make sure that your write-only transactions are large (but not too large - there are limits). You can even share this transaction with several threads.

These two strategies should give you the best possible read and write throughput. However, if you do both, be aware that you will lose conflict detection - blind writes never conflict.

I think this could also be done using the same transaction for reads and writes if you use exclusively snapshot reads, with the caveat that the transaction’s lifetime would be limited. Also, be aware that this is a relaxation of the isolation guarantees, which may or may not be desired.

As for how big these transactions can get, I wanted to point out that while a transaction will allow you to commit up to 10MB, it’s actually best for latencies (both of the transaction being committed and other simultaneous commits) if you keep the size much lower, at least under 1MB. Version 6.2 offers some additional features to help manage transaction sizes.

Is this true for throughput as well? If I want to write as much data per second as possible, I would intuitively guess that large transactions will give me a higher throughput. #Commits/second might go down but #bytes/second should go up, no?

And yes, snapshot read would also work. I am not sure whether the API allows for that, but could you reset the read version every second or so? In that case you can run it as long as you want to (obviously you would get read committed as your isolation guarantee - which is pretty bad for many use-cases)

To be honest, I’m not certain. Based on past experiences with large packets in commits, it wouldn’t surprise me if there was some inefficiency that ended up negating the expected gains. I think there are some improvements in this regard in 6.2, though, so it’s also possible that those experiences no longer apply. I’ve never actually compared the throughput of very large transactions with that of smaller ones, and I’d probably recommend testing different sizes if you were trying to increase throughput without regard for latency.