Maximizing horizontal fdb "double CAS" write throughput, okay with losing last 10 mins of transactions

TLDR: What is best way to horizontally scale “double CAS” (where key collision has very low probability) and we’re okay with losing last 10 minutes of transactions.

Longer:

This is related to my previous question Configure fdb to write to disk only once every 10 mins ; I’m writing this as a different question, as I’m attacking the problem from a different angle and providing more context.

All my transactions are of the form (k1, o1, n1, k2, o2, n2), they are all 32 bytes (so 32 * 6 = 192 bytes) total. The transaction to be executed is:

// atomically, as a single transaction
if data[k1] == o1 && data[k2] == o2 {
  data[k1] = n1;
  data[k2] = n2;
}

All the keys & values are all 32 bytes. This is basically a “double CAS”; we check two places, compare with old value, and if pass, write new values.

Hot keys are very rare but might occasionally happen. Look at all transactions (hopefully > 1M) that happened in last 2s. The number of keys involved in >= 2 transactions is < 100 (those transactions can individually choose to commit/fail, and I’m okay with it either way; some transactions failing is okay tradeoff for higher write throughput)

Similar to my previous problem, in the event of a crash, I’m okay with losing transactions, as long as we have the following guarantee: there is some time T0 where:

all transactions before T0 goes through
all transactions after T0 are ignored
T0 loses a most 10 minutes worth of data

My world is Markov-ian – given the present, I don’t care about the past.

Question:

  1. What is the optimal config for this (especially taking into account the “commit log” mentioned by @markus.pilman
  2. Is FoundationDB the right DB for this (millions of “double CAS”) workload, or should I be looking at something else ? [Other things I’m looking at are DynamoDB and ScyllaDB].

Thanks!

[All machines are in the same datacenter; likely on the same rack; I might even be able to guarantee there are at most 3 physical switches between the “fdb client” and the “fdb server”]

You’ll need to run some benchmarks to get a firm intuition here. To give better advice, it would be helpful to know if you are optimizing strictly for throughput or if you have a latency ceiling.

I’m actually not too concerned about the TLog throughput. FDB’s Transaction System is excellent at batching small writes into larger, sequential disk operations. Even with ~192-byte transactions, FDB will wait to fill a buffer before flushing. Since your workload is uniform, these batches should be very stable—this is the ‘happy path’ for FDB. Note that FDB usually struggles more with large commits; since all commits need to be serialized, a massive batch increases tail latencies. Any transactional system has this problem to some degree, but FDB’s architecture makes it particularly visible.

My primary concern is memory pressure, not disk I/O. Each Storage Server (SS) maintains a versioned treap of the last ~5 seconds of mutations for read-versioning. With high-throughput writes, the memory overhead of this treap is non-trivial. For each mutation, the system creates a node in the treap with an overhead of hundreds of bytes per key (the value size matters less here). Your mental model should be that each storage server can only keep a fixed number of mutations in memory. If your throughput per 5 seconds exceeds that limit, FDB will throttle and your throughput will tank.

You can work around this by running more storage server processes and ensuring you have enough RAM. You likely want more than one storage server per disk, but this requires careful benchmarking. Keep an eye on StorageServer.StorageQueue in the Ratekeeper metrics; if this climbs, you need more storage servers. You might eventually need more TLogs as well, since they can hit CPU limits before they saturate their disks.

Regarding your durability constraints: while the intuition that losing data buys performance is sound, it’s a dead end in FDB. The system simply isn’t built to allow this tradeoff. This isn’t just a design choice; it’s fundamental to FDB’s testing and simulation philosophy. The entire reliability model is based on “inject failures and verify the contract wasn’t broken.” Allowing for data loss breaks that contract, making the system untestable by FDB’s standards. Attempting to “hack” it is dangerous because FDB writes its own metadata as transactions. If you introduce data loss, you won’t just lose 10 minutes of transactions—you risk corrupting the system keyspace and losing the entire database.

Furthermore, you mentioned that if you lose data, it must be “prefix consistent” (if X is durable, all Y < X are durable). This is significantly harder to guarantee in a distributed system than on a single machine. On a single node, you just flush a log. In a distributed cluster, ensuring all nodes have reached the exact same point in the transaction stream after a partial failure requires complex global synchronization. Most distributed systems—FDB included—achieve this by being strictly durable; if you remove that durability, maintaining that consistent “cutoff point” across a cluster becomes an architectural nightmare.

Before committing to a distributed system, I’d really evaluate why you need one at all. Distributed databases are expensive, operationally complex, and often slower than a single-machine system at this scale due to networking overhead and cache duplication across machines. A PostgreSQL instance with a default config is optimized for a toaster, not a modern server with high core counts and NVMe. However, a well-configured single-machine PostgreSQL can often outperform a default config by multiple orders of magnitude.

Since you don’t care about losing a few minutes of data, a simple backup/restore approach on a beefy single node would likely be cheaper and faster. You could also look into Aurora to offload some of that work. I strongly recommend spending time choosing the right system now, as a bad choice here will be very costly down the road.