I’m trying to create a two datacenter configuration, each with 3 fdb nodes.
- Not more than two datacenters exist
- Active/Passive. Under normal conditions the first datacenter operates data and the second keeps a replica
- When the whole second datacenter fails, the first datacenter should continue working without any downtime. Some performance penalty is acceptable.
- When the whole first datacenter fails, there should be a capability to activate a second datacenter to operate data. Some downtime, a small data loss and a manual reconfiguration are acceptable.
- Capability of switching roles of two datacenters for maintenance without any data loss. Small downtime and a manual reconfiguration are acceptable.
My first approach was to build a DR cluster. This solution satisfies all 5 requirements but there are 2 problems
- This solution is declared as obsolete Design and Implementation of a Performant Restore System in FDB
- The DR solution has a performance penalty because all mutations need to be written to the system keyspace that doubles the writing volume.
I tried to use a suggested multi-region configuration with two regions, each having a single datacenters. I used six coordinator processes: three in each datacenters. But this configuration didn’t satisfy requirements 3 and 4: when any datacenter failid, three coordinator processes were not enouth for continuing work. Seems multi-region configuration becomes useful only with three and more datacenters that contradicts the requirement 1.
Any assymetric configuration (4 + 3 coordinators) does not survive when the datacenter with most coordinators fails.
There is a sentence In the documentation https://apple.github.io/foundationdb/configuration.html#choosing-coordinators
This is because if an entire region fails, it is still possible to recover to the other region if you are willing to accept a small amount of data loss. However, if you have lost a majority of coordinators, this becomes much more difficult.
But I cann’t find any step-by-step information, how to recover a fdb cluster when majority of coordinators are not available. Is it ever feasible?