Working High Avalable Solutions with Two Datacenters

osamarin · October 7, 2020, 9:28am

Hello!.

I’m trying to create a two datacenter configuration, each with 3 fdb nodes.

My requirements:

Not more than two datacenters exist
Active/Passive. Under normal conditions the first datacenter operates data and the second keeps a replica
When the whole second datacenter fails, the first datacenter should continue working without any downtime. Some performance penalty is acceptable.
When the whole first datacenter fails, there should be a capability to activate a second datacenter to operate data. Some downtime, a small data loss and a manual reconfiguration are acceptable.
Capability of switching roles of two datacenters for maintenance without any data loss. Small downtime and a manual reconfiguration are acceptable.

My first approach was to build a DR cluster. This solution satisfies all 5 requirements but there are 2 problems

This solution is declared as obsolete Design and Implementation of a Performant Restore System in FDB - #5 by mengxu
The DR solution has a performance penalty because all mutations need to be written to the system keyspace that doubles the writing volume.

I tried to use a suggested multi-region configuration with two regions, each having a single datacenters. I used six coordinator processes: three in each datacenters. But this configuration didn’t satisfy requirements 3 and 4: when any datacenter failid, three coordinator processes were not enouth for continuing work. Seems multi-region configuration becomes useful only with three and more datacenters that contradicts the requirement 1.

Any assymetric configuration (4 + 3 coordinators) does not survive when the datacenter with most coordinators fails.

There is a sentence In the documentation Configuration — FoundationDB 7.1

This is because if an entire region fails, it is still possible to recover to the other region if you are willing to accept a small amount of data loss. However, if you have lost a majority of coordinators, this becomes much more difficult.

But I cann’t find any step-by-step information, how to recover a fdb cluster when majority of coordinators are not available. Is it ever feasible?

alexmiller · October 9, 2020, 9:54pm

This is theoretically feasible, but not implemented. It’s been discussed as on the roadmap before, and I’d suggest @markus.pilman as perhaps a good person to talk about when that would be implemented.

Assuming you cannot hide a coordinator in any third region somewhere, then for now, the “correct” way to do this would be to use two clusters and DR precisely as you outlined above. I think I’d say more “deprecated” than “obsoleted”, as I hadn’t heard of a removal of it planned yet, but @mengxu is welcome to correct me if I’m wrong. The double write penalty is just something you’d have to live with until the better solution becomes available.

mengxu · October 9, 2020, 10:39pm

You are right. I’m unaware of any plan to remove DR anytime soon (like in at least a year).
[cc. @ajbeamon @Evan ]

osamarin · October 12, 2020, 9:31am

Thanks for reply, @mengxu @alexmiller.

Also I found a discussion in Two datacenters with double redundancy in each? But there is recommended to use a separate (third) datacenter for coordinators.

I’ll use and recommend the DR solution as a high-available solution exactly for two datacenters.

osamarin · November 27, 2020, 6:30pm

This is theoretically feasible, but not implemented. It’s been discussed as on the roadmap before, and I’d suggest @markus.pilman as perhaps a good person to talk about when that would be implemented.

Seems Recover from a cluster with no coordinated state · Issue #2022 · apple/foundationdb · GitHub is about recovering the majority of coordinators

osamarin · November 30, 2020, 9:29am

how to recover a fdb cluster when majority of coordinators are not available. Is it ever feasible?
This is theoretically feasible, but not implemented.

I’ve managed to recover my fdb cluster when the majority of coordinators was lost.

Initial state: two datacenters: a primary and a remote were in two regions. Four coordinators: three were in the primary and one was in the remote.
The full primary datacenter failed.

Steps to recover:

Stop foundation db in the secondary datacenter
Modify the cluster file to have three coordinators in the second datacenter
Copy the coordination-* files from the initial coordinator in the second datacenter to the new two ones.
Start foundation db in the secondary datacenter
Do force_recovery_with_data_loss with fdbcli

alexmiller · November 30, 2020, 10:45pm

That will work, but it’s possible that it won’t recover, if it does you’ll lose an unbounded amount of data, and theoretically open yourself to database corruption.

A recovery (or more than one) could have happened, and written the new coordinated state to only the three coordinators in the primary. Your coordinated state in the secondary is thus stale, and doesn’t know that it is stale. When you copy it to more coordinators to get back to having a quorum, you’re restoring a stale coordinated state. It’s possible that it points only to transaction log instances that no longer exist, and thus recovery will block forever. It’s possible that it points to a subset of the older transaction log instances that do exist, and then you’ll lose all data written in the newer generations of transaction logs (but it will still be a consistent snapshot).

It’s also possible that the primary half of the database could come back online that isn’t aware of your manual coordinator changes, and then you’d have two FDB clusters both trying to use the same transaction logs, which will probably have very weird behavior.

So it will work, but there’s a lot of caveats, and is why #2022 exists to provide a safe® way of doing such an operation.

osamarin · December 1, 2020, 6:50am

Yes, this scenario is not safe. For safity I’d add a step
6. Prevent the primary datacenter from starting up

But sometimes splitting a cluster to two independent parts is a desired goal. For example, when I want to create a full copy of data from a working cluster for testing.

Earlier I was using a DR cluster for cloning. But seems making this with multiregion configuration is also possible.

Topic		Replies	Views
Two datacenters with double redundancy in each? Using FoundationDB	9	1526	February 21, 2020
How to Configure region's in FoundationDB? Using FoundationDB performance	1	1575	October 24, 2019
Optimal configuration for more than 3 DCs Using FoundationDB	7	1315	July 15, 2019
HA configuration between 2 big DC and 1 small Using FoundationDB	1	421	February 8, 2021
3DC2regions--Simulating Primary Datacenter Failure Using FoundationDB bindings	7	530	June 12, 2025

Working High Avalable Solutions with Two Datacenters

Related topics