How should I choose coordination servers?

Aside from the recommended numbers in the configuration documentation, should I have 1 per rack, datacenter or machine?

Should I make every single process a coordinator?

Is there any way to have coordinators managed automatically?

What are some strategies used in production for managing very large production clusters? (ie. Apple)

2 Likes

The way we (Snowflake) chose the coordinators is by running coordinators auto on fdbcli after the cluster has been set up. But you can definitely chose them manually.

To understand how to chose coordinators (and how many to chose), you need to understand the following:

  1. Your coordinators run a paxos-like algorithm to store a small amount of state that is needed to bring up your database. Think of them as FDBs own Zookeeper (AFAIK the earliest versions of FDB used Zookeeper for this).
  2. If you lose a majority of your coordinators you lose your database (including all your data - it would be possible to restore the data but I don’t think we have any tools that do that for you).
  3. Many coordinators will make you generally saver but it will be slow.
  4. More than one coordinators on the same machine is typically pointless.

So typically you want to have 5 coordinators and you want to have them as far away from each other as possible (in the sense of your network topology).

So what we typically do is we have at least one coordinator per data center (or availability zone in AWS). If we have fewer than 5 data centers we make sure that all coordinators run on different physical machines. Additionally we chose to store the coordinator-data on a network disk (EBS) which would allow us to restore the data even if we would lose a majority of all physical machines. As the coordinated state is small, network attached storage is not problematic for this task.

So as you can see we have 15 copies of this data (5 coordinators, 3 copies per coordinator). But so far we never lost a majority of our coordinators :crossed_fingers:

2 Likes

How do you achieve this?

Are you reserving a few processes per datacenter with class=coordinator with a custom data dir pointing to EBS?

Or is there a way to specifically set the folder for coordination data files (different than the storage, queues, tlog, …)

Hi Markus, could you tell what kind of operations will be get slower with larger number of coordinators (and would this slowness be a concern)?

We actually store all our data on EBS - what I meant was storing coordinator data on EBS doesn’t come with a performance penalty (storing tlog and storage data on EBS does). I think using EBS only for coordinated data might be a valid use-case (although I am not sure it would make sense - as the rest of your data could still only survive two machine failures at most…).

Coordinators are responsible for three things and all of these would become significantly slower if you have a large number of coordinators:

  1. They elect the cluster controller (cluster controller is the leader of a cluster). This election process is done in an iterative way: election takes a short amount of time but if it fails, this election time is increased. Having more coordinators means that the probability of the election failing increases. Therefore you would need more iterations. I would imagine that with 100 coordinators election would be very very slow.
  2. They store a global state. During recovery this state is read, locked, and rewritten to all coordinators. If the system has more than one master running, one of them will eventually fail. But as this protocol involves several round-trips to all coordinators, it would slow down the process. Therefore, your recovery times will probably go up significantly if you have many coordinators.
  3. Clients connect through coordinators. IIRC they currently ask all coordinators for the coordinated state (which includes a list of proxies to which the client will talk to) and the client will chose whatever state it receives from a majority. This will be optimized in the future, but nonetheless in the worst case a client will always need to talk to all coordinators. Therefore having many coordinators will make connection establishment slower.

Apart from tese arguments I don’t really see any benefits of having more than 5 coordinators. If you have triple replication, you will lose data if you lose more than two machines. But with 5 coordinators you can afford to lose two machines as well. So it simply doesn’t feel very useful to be super careful about the coordinators if your storages and tlogs don’t provide more fault-tolerance.

2 Likes