Two datacenters with double redundancy in each?

I’m trying to build a configuration that can provide double redundancy across two data centers. My goal is to support the failure of an entire data center, plus one machine in the remaining data center, without affecting availability.

If I configure a single region cluster (using double redundancy) with two datacenters in that region, will FoundationDB automatically keep two copies of data in each data center?

We are also trying to find a good solution to run on two data centers and survive the failure of one. Sadly I am not aware of a good way of doing that. You can use fdbdr to get a hot standby, but then you still risk some data loss if you lose a data center.

The fundamental problem with two data centers is that there really is no good way to distribute your coordinators across two data centers. The coordinators run a majority vote so you always need to have a majority of coordinators up in order to survive a data center failure.

That’s a good point, Markus!

My most important requirement is having two copies of the data in each of both data centers. To always maintain write/update capability in the event of a DC failure in a 2-datacenter scenario, I could potentially maintain a small satellite in AWS so that a majority of coordinators can still be available to the survivor.

Another question comes to mind though; if a datacenter (which happens to have a minority of coordinators) loses it’s outbound network link, but is otherwise functioning internally, can I still read the data in that data center, even if it cannot be changed?

If you can do that (for some users this might be not possible due to security compliance reasons) that would probably work well as the amount of traffic that goes to the coordinators is tiny.

No. In such a scenario the cluster would not be able to recover and clients wouldn’t be able to open a connection.

Then it sounds like you’ll be fine. I’d suggest running 9 coordinators (3 in each DC and 3 in AWS) so that you can lose 1 DC and 1 machine and still have quorum. If you make a region config where you treat each of your datacenters as a “region”, you’ll be able to achieve your goals with multi-region. You should set one of the datacenters as your preferred primary, and then set up the other datacenter to be both a satellite and the preferred secondary.

1 Like

It’s possibly worth underscoring that in such a configuration, you also need to configure the logs to have presence in both regions, or FDB will never fail over from the primary to the secondary (as it can’t guarantee that there isn’t data only in the one region or the other), which is what setting the secondary data center as a satellite does.

Thanks Alec, that’s also good to keep in mind!

There is an alternative design that I’m considering, which is to create a single-data center cluster, which uses fdbdr to replicate data to a standby cluster in a second data center. Both of which could have triple or double redundancy.

There are two questions I have about this: 1.) Can I use the standby as a read-only database, with the understanding that it might be behind the primary? 2.) What is the procedure for switching the standby to full read/write if the network link between the data centers is unavailable (I believe ‘fdbdr switch’ requires both clusters to be reachable)?

Yes, but your clients need to set the read_lock_aware transaction option.

Use fdbdr abort on the secondary-now-primary.

If your primary comes back online, you’ll need to use fdbbackup cleanup to stop it from continuing to save the mutation stream. (Though it’s a command on fdbbackup, it applies to DR also.)

This is some excellent clarification, Alex! I assume the abort operation is required because there will be a replication process running on each cluster, pointed at the other? I keep seeing various DR configurations discussed here in the forums, or on blogs, etc., but haven’t seen any comprehensive explanation on these configurations in the official documentation. Is there any extensive DR writeup on that I’m missing?

I’ve also seen mentions of bi-directional replication with a two-cluster setup. Is that also a possibility, where you have two read+write clusters that replicate data to each other?

The abort operation is required to configure the secondary to waiting for and applying mutations from the primary, and to unlock it for use as now a primary.

I only know of Backup, Restore, and Replication for Disaster Recovery for DR and backup documentation.

I think what you’ve seen is discussions of setting up DR agents in both directions, so that you can set up a DR in either direction. Presumably once you convert a secondary into a primary, you’ll want to convert the former primary into a DR secondary, and having dr_agents already configured to run makes that much easier. You cannot set up DRs in both directions to try and build a multi-master FDB.