DR Explanations/Help

I’m trying to understand DR in foundationdb.

I’m looking to understand how to configure things for the following scenario.

Colo facility (Primary):
This is where our main fdb cluster runs and where all active systems operate

AWS (DR site):
This is where I want to run a DR cluster

Getting DR up and running and syncing to the DR site is not an issue. What I’m failing to understand is how to fail over from one site to another in case of the entire primary facility going offline. eg worst-case scenario a hurricane knocking a facility offline for an extended period

The documentation states that to ‘switch’ both clusters need to be online/accessible. In testing this proves to be true. I’ve tested this with a simple single → single DR setup, and then stopping the primary. Attempts to use fdbdr switch and fdbdr abort both just hang, and the DR server remains locked and unusable.

So my questions are really:

  1. Is there a “blessed” method for manually disabling the DR process and unlocking the DR cluster when only the DR cluster is accessible?
  2. What would happen should the inaccessible cluster become available again should we be able to do this? Will it just start pushing data to the now-live DR site resulting in corruption?

Maybe there is a blog post or tutorial I’ve been unable to find which covers DR better than the documentation.


Answering myself.

For this scenario it appears that using options --cleanup and --dstonly to the fdbdr abort command will stop the DR process on the DR destination node and unlock the DB

When the source cluster comes back online, it the DR process is marked as aborted. I’m guessing this is initiated by the running dr_agent.

Even if the dr_agent is stopped, the destination appears to eventually change “Running DRs - 1 as secondary” to “Running DRs - 0” and the database is unlocked.

Hopefully I’m understanding things correctly and not going to shoot myself in the foot at some point with this understanding