Optimal configuration for more than 3 DCs

jonahwest · July 15, 2019, 4:50am

What is the optimal configuration for a cluster spanning more than 3 datacenters? Currently, three_datacenter mode is available and will maintain availability during an entire DC failure, however, this does not address the issue of having more than 3 DCs.

Can three_datacenter mode be used with more than three DCs? If not, what would a cluster that spans multiple DCs look like and how would latencies and availability work?

alexmiller · July 15, 2019, 4:54am

I think most of this was answered in:

And I’d be happy to expand on any part of that if you’d like.

jonahwest · July 15, 2019, 5:13am

Thanks for this! A few questions though:

In theory, if three_datacenter were to work with more than 3 DCs, would it still achieve single DC fault tolerance and reasonable latency? Any other unforeseen side effects?
Where is three_datacenter implemented in the source?
The previous answer mentions future support for multi-region failure tolerance. Will this support an arbitrary number of datacenters?

alexmiller · July 15, 2019, 5:24am

It would give you what it promises: Your data will be stored in three different datacenters, and committed across two of them. With the “I haven’t tested this so I might be wrong” disclaimer, running with 5 DC’s would end up with each DC having a random 3/5 of your total data. You’d prefer to read it from the local DC if if it’s there, but a random 2/5 of your data would be remote. You’d still have one DC as the place where writes happen, and commits would go across it and another random DC. You’d still only be able to survive one DC failure, and a loss of two DCs simultaneously would potentially render FDB unavailable.

three_datacenter turns into the TLog and Storage policies in fdbclient/ManagementAPI.actor.cpp. storagePolicy reads “I require two different failure domains within each of 3 datacenters.”. tLogPolicy reads “I require two different failure domains within each of 2 datacenters”. This adds up to 6 storage processes will hold data, and we require a minimum of 4 TLogs to fulfill this policy.

That’s the goal. There’s a lot of places in the code that are hardcoded to two datacenters. Extending that to three will largely be the work required to extend it to more than three. I have no timeframe to offer for when this work will be done though. Support generic cross-DC replication plans is the issue to follow for updates on this.

jonahwest · July 15, 2019, 5:33am

I may be understanding this wrong, but could you please clarify this? I thought writes are spread evenly across the cluster even in three_datacenter.

alexmiller · July 15, 2019, 5:46am

tLogPolicy says each mutation is synchronously committed to two different datacenters. If you lose one DC, we can still recover copies of all mutations that haven’t yet been durably applied by storage servers from the one DC that’s still available.

With three DCs, data is spread evenly across the three clusters with three_datacenter. The actual commit traffic will be in the DC where the rest of the {proxy, resolver, master} transaction subsystem exists, and one of the other two datacenters. The decision of which other datacenter will be random and will be fixed when recruitment is happening as a part of recovery.

With five DCs, team building will build storage teams that store 2 copies of data in each of 3 DCs, but should use all of the DCs available. Thus, one will end up with with a roughly even distribution of 3/5 of the data in each datacenter. The choice of which one other DC will be used to satisfy the requirement for mutations to be committed on transaction logs synchronously across two DCs is still made at recovery time, so there will still be a fixed choice of one other DC of the four available options, that would receive all of the commit traffic.

There is, in theory, I suppose nothing stopping us from making a five_datacenter mode that says storage policy = 5 dcid x 2 zoneid and log policy = 3 dcid x 2 zoneid, which would then give you a configuration for all data to be in 5 DCs and tolerate the loss of 2 DCs. If 1 DC failure tolerance vs 2 DC failure tolerance is a large enough deal to someone, I can open an issue to more concretely discuss adding this. However, you still inherit all of the issues that three_datacenter has that leads to multi-region being the greatly preferred way to run a highly available FDB cluster, and it thus would be immediately deprecated as soon as multi-region supports more than two DCs.

Apparently, we’ve only mentioned the downsides collectively as part of Evan’s multi-region talk, so you can hear an overview of three_datacenter mode starting at 3m45s, and the downsides starting at 7m09s.

Which is:

High storage overhead
High latencies
High WAN bandwidth usage

… and I could have sworn there was something about data distribution is disabled whenever a DC is failed, because it thinks every team is unhealthy and has no way to resolve that.

jonahwest · July 15, 2019, 6:49am

All this has been very helpful. Thanks! If multi-region is the way to go, how would things work with more than 2 regions? It seems like multi-region is designed around having two regions in a “master/slave” configuration, if you will, thus adding more regions would require very different semantics (I may be wrong, however). And is there any timeline for such?

alexmiller · July 15, 2019, 9:32am

The last time there were conversations about it, the idea was to let the user configure regions as having synchronous or asynchronous replication. Synchronously replicated regions would get transaction logs in them that directly receive commits. Asynchronously replicated regions would get log routers that efficiently pull mutations across the WAN once they’re committed. To handle region failures there would probably need to be some sort of “synchronous to 3 of any 5 of these regions” or “synchronous to these regions’s if they’re available” logic, or something. Details obviously TBD, so what’s presented to the user might change drastically.

As it is today, satellites are synchronous DCs, and the secondary is an asynchronous one. So committing to 5 synchronous regions would look much like committing to the primary and 2 satellites in the code base today. Asynchronously replicating to 5 asynchronous regions would work like how replicating to the secondary works today. So the base code and patterns are there, it just needs to be generalized, which is a lot more work than it sounds.

There is no current timeline for this work, nor is anyone actively working on it right now.

Topic		Replies	Views
Replication method consideration Using FoundationDB	1	784	May 31, 2019
Evenly split replicas across datacenters Using FoundationDB	3	552	July 19, 2019
Ideal setup for Fault Tolerance = 3 in Triple mode Using FoundationDB	2	1196	July 15, 2019
Cross Datacenter/Region? Using FoundationDB	9	3363	June 14, 2018
Working High Avalable Solutions with Two Datacenters Using FoundationDB	7	1164	December 1, 2020

Optimal configuration for more than 3 DCs

Related topics