Question about FoundationDB's Sigmod '21 paper

Hi, after rereading FDB’s paper, I have the question why FDB uses active disk paxos for high availability, rather than a normal multi-paxos or raft consensus protocol?

I think this is related how FDB is upgraded. By design, FDB doesn’t have external dependency, thus all configuration data is stored in FDB itself, i.e., in coordinators. This design rules out the design that stores configuration data in an external service, e.g., zookeeper.

To upgrade an FDB cluster, the design is to bounce all processes in the cluster, which is a way to simplify protocol compatibility among different versions. After the bounce/restart, all configuration data is read from disk states stored on multiple machines. So the consensus protocol is chosen to be disk Paxos.

Thank you for your reply! Then another question is why you choose to implement active disk paxos rather than normal consensus protocols, like multi-paxos or raft?

As far as I know, normal consensus protocols also require durability. They can flush configuration data to the disk before restart and read the data from the disk after restart. Therefore, I think the difference between normal consensus protocols and disk paxos is small.

Are there any other reasons encouraging your to choose active disk paxos?

There was an idea that coordinators shouldn’t need to know about each other, nor should they be required to be a part of the cluster that they’re coordinators for. Coordinators learn that they’re coordinators only because some process tries to talk to them as if they’re a coordinator. (A coordinator doesn’t need to have itself as a coordinator in its cluster file to be a coordinator for that cluster.) Coordinators can also be shared across multiple (separate) clusters. If you try to design a consensus implementation where the actual processes participating in consensus don’t communicate directly, you’re basically implementing disk paxos. Active Disk Paxos extends disk paxos in ways that remove bounds and limitations.

But honestly, I’ve really struggled to follow the paper, and never managed to match it up against the actual paxos implementation in FDB. Active Disk Paxos was what we went with because when talking with the person who wrote it many years later, he said “oh, yeah, that sounds about right” , but it’s totally possible that if someone invested the effort to match the implementation against the paper there’d be notable deviances from what the paper outlines.

In the decade since all these decisions were made, I don’t think I’ve seen anyone deploy a shared set of coordinators, or for deployment reasons can’t have coordinators know about each other. I can sort of see how a decade+ ago when bare-metal deployments were very common, being able to say that you can run one set of very widely geographically distributed/replicated coordinators and re-use them for all your clusters sounded like a good feature, but the reality of deployments (and protocol version compatibility limitations) hasn’t worked out that way. So if this were all rewritten today, I don’t think there’s a reason to strongly prefer disk paxos over leadered or leaderless paxos. But there’s also no reason to rewrite the paxos implementation that currently exists, because it works just fine.

1 Like

Got it!

Your reply completely solve my questions. Thanks a lot :slight_smile: