Working High Avalable Solutions with Two Datacenters

Hello!.

I’m trying to create a two datacenter configuration, each with 3 fdb nodes.

My requirements:

  1. Not more than two datacenters exist
  2. Active/Passive. Under normal conditions the first datacenter operates data and the second keeps a replica
  3. When the whole second datacenter fails, the first datacenter should continue working without any downtime. Some performance penalty is acceptable.
  4. When the whole first datacenter fails, there should be a capability to activate a second datacenter to operate data. Some downtime, a small data loss and a manual reconfiguration are acceptable.
  5. Capability of switching roles of two datacenters for maintenance without any data loss. Small downtime and a manual reconfiguration are acceptable.

My first approach was to build a DR cluster. This solution satisfies all 5 requirements but there are 2 problems

  1. This solution is declared as obsolete Design and Implementation of a Performant Restore System in FDB
  2. The DR solution has a performance penalty because all mutations need to be written to the system keyspace that doubles the writing volume.

I tried to use a suggested multi-region configuration with two regions, each having a single datacenters. I used six coordinator processes: three in each datacenters. But this configuration didn’t satisfy requirements 3 and 4: when any datacenter failid, three coordinator processes were not enouth for continuing work. Seems multi-region configuration becomes useful only with three and more datacenters that contradicts the requirement 1.

Any assymetric configuration (4 + 3 coordinators) does not survive when the datacenter with most coordinators fails.

There is a sentence In the documentation https://apple.github.io/foundationdb/configuration.html#choosing-coordinators

This is because if an entire region fails, it is still possible to recover to the other region if you are willing to accept a small amount of data loss. However, if you have lost a majority of coordinators, this becomes much more difficult.

But I cann’t find any step-by-step information, how to recover a fdb cluster when majority of coordinators are not available. Is it ever feasible?

This is theoretically feasible, but not implemented. It’s been discussed as on the roadmap before, and I’d suggest @markus.pilman as perhaps a good person to talk about when that would be implemented.

Assuming you cannot hide a coordinator in any third region somewhere, then for now, the “correct” way to do this would be to use two clusters and DR precisely as you outlined above. I think I’d say more “deprecated” than “obsoleted”, as I hadn’t heard of a removal of it planned yet, but @mengxu is welcome to correct me if I’m wrong. The double write penalty is just something you’d have to live with until the better solution becomes available.

You are right. I’m unaware of any plan to remove DR anytime soon (like in at least a year).
[cc. @ajbeamon @Evan ]

Thanks for reply, @mengxu @alexmiller.

Also I found a discussion in Two datacenters with double redundancy in each? But there is recommended to use a separate (third) datacenter for coordinators.

I’ll use and recommend the DR solution as a high-available solution exactly for two datacenters.