Two datacenters with double redundancy in each?

TrajansRow · February 14, 2020, 7:22pm

I’m trying to build a configuration that can provide double redundancy across two data centers. My goal is to support the failure of an entire data center, plus one machine in the remaining data center, without affecting availability.

If I configure a single region cluster (using double redundancy) with two datacenters in that region, will FoundationDB automatically keep two copies of data in each data center?

markus.pilman · February 14, 2020, 7:31pm

We are also trying to find a good solution to run on two data centers and survive the failure of one. Sadly I am not aware of a good way of doing that. You can use fdbdr to get a hot standby, but then you still risk some data loss if you lose a data center.

The fundamental problem with two data centers is that there really is no good way to distribute your coordinators across two data centers. The coordinators run a majority vote so you always need to have a majority of coordinators up in order to survive a data center failure.

TrajansRow · February 14, 2020, 9:36pm

That’s a good point, Markus!

My most important requirement is having two copies of the data in each of both data centers. To always maintain write/update capability in the event of a DC failure in a 2-datacenter scenario, I could potentially maintain a small satellite in AWS so that a majority of coordinators can still be available to the survivor.

Another question comes to mind though; if a datacenter (which happens to have a minority of coordinators) loses it’s outbound network link, but is otherwise functioning internally, can I still read the data in that data center, even if it cannot be changed?

markus.pilman · February 14, 2020, 11:33pm

If you can do that (for some users this might be not possible due to security compliance reasons) that would probably work well as the amount of traffic that goes to the coordinators is tiny.

No. In such a scenario the cluster would not be able to recover and clients wouldn’t be able to open a connection.

alexmiller · February 15, 2020, 12:12am

Then it sounds like you’ll be fine. I’d suggest running 9 coordinators (3 in each DC and 3 in AWS) so that you can lose 1 DC and 1 machine and still have quorum. If you make a region config where you treat each of your datacenters as a “region”, you’ll be able to achieve your goals with multi-region. You should set one of the datacenters as your preferred primary, and then set up the other datacenter to be both a satellite and the preferred secondary.

alloc · February 17, 2020, 3:02pm

It’s possibly worth underscoring that in such a configuration, you also need to configure the logs to have presence in both regions, or FDB will never fail over from the primary to the secondary (as it can’t guarantee that there isn’t data only in the one region or the other), which is what setting the secondary data center as a satellite does.

TrajansRow · February 20, 2020, 9:38pm

Thanks Alec, that’s also good to keep in mind!

There is an alternative design that I’m considering, which is to create a single-data center cluster, which uses fdbdr to replicate data to a standby cluster in a second data center. Both of which could have triple or double redundancy.

There are two questions I have about this: 1.) Can I use the standby as a read-only database, with the understanding that it might be behind the primary? 2.) What is the procedure for switching the standby to full read/write if the network link between the data centers is unavailable (I believe ‘fdbdr switch’ requires both clusters to be reachable)?

alexmiller · February 20, 2020, 11:31pm

Yes, but your clients need to set the read_lock_aware transaction option.

Use fdbdr abort on the secondary-now-primary.

If your primary comes back online, you’ll need to use fdbbackup cleanup to stop it from continuing to save the mutation stream. (Though it’s a command on fdbbackup, it applies to DR also.)

TrajansRow · February 21, 2020, 3:44pm

This is some excellent clarification, Alex! I assume the abort operation is required because there will be a replication process running on each cluster, pointed at the other? I keep seeing various DR configurations discussed here in the forums, or on blogs, etc., but haven’t seen any comprehensive explanation on these configurations in the official documentation. Is there any extensive DR writeup on foundationdb.org that I’m missing?

I’ve also seen mentions of bi-directional replication with a two-cluster setup. Is that also a possibility, where you have two read+write clusters that replicate data to each other?

alexmiller · February 21, 2020, 11:57pm

The abort operation is required to configure the secondary to waiting for and applying mutations from the primary, and to unlock it for use as now a primary.

I only know of Backup, Restore, and Replication for Disaster Recovery for DR and backup documentation.

I think what you’ve seen is discussions of setting up DR agents in both directions, so that you can set up a DR in either direction. Presumably once you convert a secondary into a primary, you’ll want to convert the former primary into a DR secondary, and having dr_agents already configured to run makes that much easier. You cannot set up DRs in both directions to try and build a multi-master FDB.

Topic		Replies	Views
Working High Avalable Solutions with Two Datacenters Using FoundationDB	7	1167	December 1, 2020
High available solution with 2.5 datacenters Using FoundationDB	5	457	January 11, 2022
Cross Datacenter/Region? Using FoundationDB	9	3395	June 14, 2018
Replication method consideration Using FoundationDB	1	789	May 31, 2019
Optimal configuration for more than 3 DCs Using FoundationDB	7	1321	July 15, 2019

Two datacenters with double redundancy in each?

Related topics