Write-availability of FoundationDB

I found the following comparison of yugabyteDB and FoundationDB: Compare FoundationDB with YugabyteDB | YugabyteDB Docs

The article includes some claims about problems with write-availability in FoundationDB:

FoundationDB takes a very different approach in this regard. Instead of using a distributed consensus protocol for data replication, it follows a custom leaderless replication protocol that commits writes to ALL replicas (aka the transaction logs) before the client is acknowledged.

The following table states that all replicas must be available to commit a transaction:


The description in the article does not fit with the understanding I got from reading the FoundationDB documentation.
From there, my understanding was that

  1. FoundationDb is available as long as a majority of the coordinators and at least one of the replicas for each data item is available.
  2. When machines go down, there can be several seconds unavailability but otherwise the system stays available.

Is my understanding correct or should I expect longer downtimes in write-availability when a machine fails?

There are two aspects that you need to consider separately:

  • coordinators: this is where the state of the fdb cluster is maintained. This part of fdb uses Paxos to maintain consistent, distributed state of the cluster. It is imperative that n/2 +1 of these nodes are available for cluster to be operational. Also note, that typically the coordinator nodes are very small fraction of the large cluster (typically 3, 5, 9 etc). Also note that these roles do not automatically move over if a machine fails. One needs to manually exclude/include machines in this role by command/config.
  • TLog: this is being referred to as the replicas in the article. Fdb commits the transaction to all the TLog replicas, for the data being touched in the transaction, before considering it as committed. So, if you have n data replication factor, then you need at least n alive nodes in the cluster. However, note that these roles can be assigned to nodes on demand, if the machines holding these roles fail. So unless you have simultaneous failure of n+ nodes, and you are unfortunate that some data shard had all its replicas on these n failed nodes, you would not run into availability issues. E.g you can have 20 node cluster, with say, 10 tlogs roles, and data replication as 3. For each data shard(few mbs wide) some team made of 3 of the 10 TLog nodes will hold the replicas. There are many such teams. If a TLog node fails, then the replicas will be shifted to other teams, or new teams may be created dynamically, or new TLog nodes will be recruited dynamically to get the TLog counts back. In this example, you can go upto 17 node failure, gradually, without running into availability issues (assuming original coordinator count was <=5)

Thanks, that cleared things up for me. Just a quick follow-up question:

Does “without running into availability issues” mean that there is no availability loss at all or will there be a short period (e.g. a few seconds) of unavailability while the replicas are shifted to other teams?

There might be few seconds (worst case) to recruit node into roles, or to update shard to TLog mapping. But that will be transparent to users, as it will be taken care of ny automatic transaction retry by the fdb client. Or it could be that the transaction takes a few seconds longer to commit during the reassignment. But clients should not see any failures.

1 Like

The “Minimum number of nodes to allow fault tolerance” is… misleading at best. I worked through one round of corrections to that page with them before, but some pieces were still disputed. FoundationDB didn’t have great public documentation at the time when their article was written (and it could still be much better). They gave a very respectable attempt at consuming and digesting the information that was publicly available, but it would have been nice if they double checked it with us before posting, especially given that we’d already met. :man_shrugging:

RF=5 means that you can tolerate 4 failures in FDB, but only that you can tolerate 2 failures in Raft. It’d be far more fair to compare fault tolerance over Replication Factor:

F=1 F=2 F=3
FDB Raft FDB Raft FDB Raft
Replicas 2 3 3 5 4 7

Which is really just comparing F+1 (FDB) vs 2F+1 (Raft). Most folk run with F=2, so that’s triple or 3 replicas in FDB, and RF=5 or 5 replicas in Raft.

That table was also trying to show that some amount of extra nodes are required for FDB to be able to reconfigure in reaction to a failure, which is true. The minimum size for a 3 replica cluster in FDB is 5 nodes, so that if 2 of them fail, there are 3 left to reconfigure to. The minimum size of a Raft-based DB for 2 failures is also 5. This is why their table shows FDB: 5 for RF=3.

However, this is misleading once the size of the cluster increases. FDB requires 3 replicas * number of shards + 2 nodes to survive 2 failures. Raft requires 5 replicas * number of shards nodes to survive 2 failures. As the number of shards approach infinity, that means FDB requires 40% less nodes, whereas their table presents the opposite view.

The tradeoff for this cost savings is that there’s an unavailability window during the reconfiguration step. In chain replicated databases, this was shard unavailability during reconfiguring the chain. In FDB, this is write unavailability to all shards. It requires a failure of a key transaction subsystem component though, which is typically presented as 10% of the cluster, so a 10% chance of a node failure causing a few seconds of unavailability. Raft-based databases will have temporary shard unavailability if a leader fails, so on a large enough cluster, every node failure will cause a shard of data to be unavailable until a new raft leader is re-elected, but that’s only a small fraction of the shards being unavailable.

I personally view the tradeoff as FDB gives you 99.95% availability and you pay for 3 replicas per shard. Raft gives you 99.99% availability and you pay for 5 replicas per shard. Choose the one you wish depending on the financial value of availability for you.