Optimal configuration for more than 3 DCs

What is the optimal configuration for a cluster spanning more than 3 datacenters? Currently, three_datacenter mode is available and will maintain availability during an entire DC failure, however, this does not address the issue of having more than 3 DCs.

Can three_datacenter mode be used with more than three DCs? If not, what would a cluster that spans multiple DCs look like and how would latencies and availability work?

I think most of this was answered in:

And I’d be happy to expand on any part of that if you’d like.

Thanks for this! A few questions though:

  • In theory, if three_datacenter were to work with more than 3 DCs, would it still achieve single DC fault tolerance and reasonable latency? Any other unforeseen side effects?
  • Where is three_datacenter implemented in the source?
  • The previous answer mentions future support for multi-region failure tolerance. Will this support an arbitrary number of datacenters?

It would give you what it promises: Your data will be stored in three different datacenters, and committed across two of them. With the “I haven’t tested this so I might be wrong” disclaimer, running with 5 DC’s would end up with each DC having a random 3/5 of your total data. You’d prefer to read it from the local DC if if it’s there, but a random 2/5 of your data would be remote. You’d still have one DC as the place where writes happen, and commits would go across it and another random DC. You’d still only be able to survive one DC failure, and a loss of two DCs simultaneously would potentially render FDB unavailable.

three_datacenter turns into the TLog and Storage policies in fdbclient/ManagementAPI.actor.cpp. storagePolicy reads “I require two different failure domains within each of 3 datacenters.”. tLogPolicy reads “I require two different failure domains within each of 2 datacenters”. This adds up to 6 storage processes will hold data, and we require a minimum of 4 TLogs to fulfill this policy.

That’s the goal. There’s a lot of places in the code that are hardcoded to two datacenters. Extending that to three will largely be the work required to extend it to more than three. I have no timeframe to offer for when this work will be done though. Support generic cross-DC replication plans is the issue to follow for updates on this.

I may be understanding this wrong, but could you please clarify this? I thought writes are spread evenly across the cluster even in three_datacenter.

tLogPolicy says each mutation is synchronously committed to two different datacenters. If you lose one DC, we can still recover copies of all mutations that haven’t yet been durably applied by storage servers from the one DC that’s still available.

With three DCs, data is spread evenly across the three clusters with three_datacenter. The actual commit traffic will be in the DC where the rest of the {proxy, resolver, master} transaction subsystem exists, and one of the other two datacenters. The decision of which other datacenter will be random and will be fixed when recruitment is happening as a part of recovery.

With five DCs, team building will build storage teams that store 2 copies of data in each of 3 DCs, but should use all of the DCs available. Thus, one will end up with with a roughly even distribution of 3/5 of the data in each datacenter. The choice of which one other DC will be used to satisfy the requirement for mutations to be committed on transaction logs synchronously across two DCs is still made at recovery time, so there will still be a fixed choice of one other DC of the four available options, that would receive all of the commit traffic.

There is, in theory, I suppose nothing stopping us from making a five_datacenter mode that says storage policy = 5 dcid x 2 zoneid and log policy = 3 dcid x 2 zoneid, which would then give you a configuration for all data to be in 5 DCs and tolerate the loss of 2 DCs. If 1 DC failure tolerance vs 2 DC failure tolerance is a large enough deal to someone, I can open an issue to more concretely discuss adding this. However, you still inherit all of the issues that three_datacenter has that leads to multi-region being the greatly preferred way to run a highly available FDB cluster, and it thus would be immediately deprecated as soon as multi-region supports more than two DCs.

Apparently, we’ve only mentioned the downsides collectively as part of Evan’s multi-region talk, so you can hear an overview of three_datacenter mode starting at 3m45s, and the downsides starting at 7m09s.

Which is:

  • High storage overhead
  • High latencies
  • High WAN bandwidth usage

… and I could have sworn there was something about data distribution is disabled whenever a DC is failed, because it thinks every team is unhealthy and has no way to resolve that.

All this has been very helpful. Thanks! If multi-region is the way to go, how would things work with more than 2 regions? It seems like multi-region is designed around having two regions in a “master/slave” configuration, if you will, thus adding more regions would require very different semantics (I may be wrong, however). And is there any timeline for such?

The last time there were conversations about it, the idea was to let the user configure regions as having synchronous or asynchronous replication. Synchronously replicated regions would get transaction logs in them that directly receive commits. Asynchronously replicated regions would get log routers that efficiently pull mutations across the WAN once they’re committed. To handle region failures there would probably need to be some sort of “synchronous to 3 of any 5 of these regions” or “synchronous to these regions’s if they’re available” logic, or something. Details obviously TBD, so what’s presented to the user might change drastically.

As it is today, satellites are synchronous DCs, and the secondary is an asynchronous one. So committing to 5 synchronous regions would look much like committing to the primary and 2 satellites in the code base today. Asynchronously replicating to 5 asynchronous regions would work like how replicating to the secondary works today. So the base code and patterns are there, it just needs to be generalized, which is a lot more work than it sounds.

There is no current timeline for this work, nor is anyone actively working on it right now.

1 Like