Relax consistency guarantees

Is it possible to relax the consistency guarantees of transactions? From my current understanding, writes must be durable on all replicas before the transaction is committed, thus impacting latency. Snapshot reads are available but they only relax isolation guarantees.

As far as I think, your understanding is wrong. The only thing need to be durable before committing is transaction log. Which should(If I remember correctly) be an append-only log file. Updating an append-only log file is very fast.
So for failure tolerance, we not only need to keep at least one of the three storage servers alive, but also need to keep at least one of the three transaction log servers alive.

A commit needs to be persisted to disk on multiple tlog servers (same number as replication factor) before transaction can be acknowledged. I am not sure if there is any requirement for any of the storage servers to be alive at time of transaction commit.

To amortize the cost of transaction commits, proxies batch multiple transactions happening over small durations.

If you want to improve latency at the expense of consistency, you can use CAUSAL_READ_RISKY (which makes getting a read version slightly cheaper), or avoid getting new read versions frequently by manually setting the read version for some of your transactions. You might read a stale value, but you still won’t be able to commit a transaction that read a stale value (unless you opt out of that too by messing with read conflict ranges/snapshot reads)

For acknowledging before a commit is durable on all transaction logs, you can configure a non-zero “tlog anti quorum”, but I don’t think anyone actually does this or recommends doing it because it could allow a tlog to fall behind.

Thanks for pointing these out. Aside from disk latency, is there any way to decrease latency caused by the network at the expense of consistency? (Something similar to Cassandra’s tunable consistency.)

If you run a high enough load and don’t expect a high conflict rate, you could do the following:

  • Keep a shared atomic in memory for maxVersion and initialize it with 0
  • After each commit, get the committed version and set maxVersion to the max of the committed version and your version (atomically)
  • When you start a transaction, check whether maxVersion > 0 - if it is set the read version explicitly to maxVersion.
  • If any transaction fails with past_version, set maxVersion to 0 - this will cause the next starting transaction to fetch a version again (this can happen either because your load was very low or if there was a recovery).

If you do this, you will lose some external consistency guarantees - but all your transactions will still be serializable. Reading will then be very close to optimal (1 network roundtrip per read as soon as your local cache for storage server locations is warmed up and until you have failures or partitions are moved).

To speed up writing you can batch transactions together. If you are willing to give up consistency you can speed up your writes like this (although it will be a bit tricky to get right):

  • Only start read-only transactions to read, never write anything in the same transaction from where you read.
  • Have a separate write transaction where you send your writes to.
  • Make sure that your write-only transactions are large (but not too large - there are limits). You can even share this transaction with several threads.

These two strategies should give you the best possible read and write throughput. However, if you do both, be aware that you will lose conflict detection - blind writes never conflict.

I think this could also be done using the same transaction for reads and writes if you use exclusively snapshot reads, with the caveat that the transaction’s lifetime would be limited. Also, be aware that this is a relaxation of the isolation guarantees, which may or may not be desired.

As for how big these transactions can get, I wanted to point out that while a transaction will allow you to commit up to 10MB, it’s actually best for latencies (both of the transaction being committed and other simultaneous commits) if you keep the size much lower, at least under 1MB. Version 6.2 offers some additional features to help manage transaction sizes.

Is this true for throughput as well? If I want to write as much data per second as possible, I would intuitively guess that large transactions will give me a higher throughput. #Commits/second might go down but #bytes/second should go up, no?

And yes, snapshot read would also work. I am not sure whether the API allows for that, but could you reset the read version every second or so? In that case you can run it as long as you want to (obviously you would get read committed as your isolation guarantee - which is pretty bad for many use-cases)

To be honest, I’m not certain. Based on past experiences with large packets in commits, it wouldn’t surprise me if there was some inefficiency that ended up negating the expected gains. I think there are some improvements in this regard in 6.2, though, so it’s also possible that those experiences no longer apply. I’ve never actually compared the throughput of very large transactions with that of smaller ones, and I’d probably recommend testing different sizes if you were trying to increase throughput without regard for latency.

Wanted to add to this from our recent experience:

We have been careful to not make individual transactions grow too much in size (<1MB), but in places where we needed to delete a bunch of keys/key-ranges, we ended up fitting a lot of clear/clear-range operations in a single transaction (~500 / tx), expecting to save tx start/commit time.

This caused significant performance issues for us - SS kind of takes huge pauses when there are such transactions - SS starts to show alternating read-spikes followed by write-spikes with long pauses (seconds) in between. This causes SS queues and TLog queues to build up significantly, and they take long time to clear even after the load is removed.

Breaking these operations into a lot of smaller transactions (each transaction clears ~10 keys/key-ranges) makes FDB perform much better. There are no pauses and FDB is able to clear the incoming load much smoother. We still need to come with the optimal number of keys/tx, but for now, keeping these transactions touch fewer rows has definitely helped.

I will provide more details/test-case for reproducing this sometime later (we are in middle of some crunch time at present).

Interesting… We had a bit of a similar issue: if a clear is deleting too much data, FDB might choke (specifically the sqlite storage engine will be unhappy). The behavior we’ve seen has been very similar to what you’re describing.

So I am not convinced that your problem here is a large transaction. I think it is that you’re deleting data too quickly (the fact that this can be problematic is something we have to fix eventually… Maybe Redwood will help).

Are you suggesting that individual large clear/clear-range operations could be choking FDB? We are quite certain that in our case individual clear/clear-range operations are not deleting more than few 10 KB of data. It is just that when we do too many of these in one transaction, fdb tends to choke up.

It is possible that doing those in separate tx slows down the rate of issuing the transactions and gives SS more time to digest those.

Yes - at least this is our experience. Though to be fair, this became a problem when we were deleting a lot of data.

I see… In that case I would assume that you’re probably running into a different problem. Is it possible that you ran into the problem that the storage queue is horribly inefficient when it comes to memory utilization? Each mutation can generate up to 1KB memory overhead (and I think with clear ranges you run pretty much always into this worst case). The SS needs to keep 5 seconds of mutations in memory and we saw workloads where this became a bottleneck even before the disk ran into issues.

There was a previous thread on this: Large clear range performance. I haven’t heard of anyone hunting down if there was a change to how background deletion was tuned for FDB3 vs FDB6 or an EBS vs local disk thing. (Though there seemed to be a vote or two in that thread that even FDB6+EBS was fine? :shrug:)

I do not know the exact reason yet. From my observations it is consistent that same number of mutations (clear and clear-range combinations), when packed more heavily per transaction leads to a much worse SS behavior.

Can you try to debug this? Would be interesting to know what bottleneck you’re hitting there. As SS is spiking, I assume the TLogs are fine. It would be awesome if you could try the following:

  1. Replace all clear and clearrange mutations with set mutations. If you hit the same issue, it will tell us that you simply run into memory pressure issues with the SS.
  2. Replace your many small clear range mutations with few large ones. Not sure how easy that will be in your workload - but generally I would assume that this would increase performance significantly.
  3. Do you know how the CPU utilization looks like when SS goes up? And what about disk?
  4. If CPU is very high during these periods you can attach perf to the storage during one of these spikes. You can do perf record -p PID -g. You can then post the results here.

Thanks for these suggestions. I will try these out in a few days and post the findings. We are in middle of a few things right now, so this may take a bit of time to come back to.

Some other things to do that are maybe a little easier to get started with is to look at the various performance metrics and try to identify any that look bad. This can help to focus the investigation going forward. You could look at:

  • CPU and Disk usage (both obtainable from ProcessMetrics if you have historical logs)
  • NetworkMetrics.S2Pri*, which is a sum of squared times between when a given priority is serviced by the run loop (this has been renamed in 6.2 to NetworkMetrics.SumOfSquaredPriorityBusy*). This can be used to narrow down the priority of any work that is monopolizing the CPU.
  • SlowTask events with long durations. These represent blocks of work that last a long time without yielding to other work. If you have a lot of these of sufficient length, they come with backtraces that could be used to identify the offending block.