Fdbdr destination cluster doesn't recognise DR is running?

Very weird one here, looking for any insight into how it might have happened.

We’ve got 2 FDB clusters in our development environment, in 2 regions. The primary is normally set to replicate to the secondary. I manually rolled out a change I was testing which had worked in a new test cluster to our primary dev cluster, the result of which was to utterly destroy the primary cluster (I mistakenly triggered a format of the volumes on which the FDB data directory was held). No worries, this is why we have a dev env and DR.

  • On the secondary cluster, I ran fdbdr abort --dstonly --source <primary_cluster_file> --destination <secondary_cluster_file>
  • I (re)created a new DB on the original primary cluster, then on the secondary I ran fdbdr start --source <secondary_cluster_file> --destination <primary_cluster_file>
  • I verified that the DR was running, and that the sum of key/value storage on the new DB was increasing. fdbdr status ... was reporting that the DR was not a complete copy (as expected).
  • After some time, fdbdr status ... was reporting that DR was a complete copy, and that the destination was ~1.5s behind. Checking the key/value sizes on both clusters gave expected and very close amounts too (our application was running from the secondary region at this point, so it was constantly having data written to it to sync across).
  • I ran fdbdr switch --source <secondary_cluster_file> --destination <primary_cluster_file>
  • I got a fatal error from fdbdr

Running with trace logging enabled on the fdbdr command, it would seem the actual error is:

{  "Severity": "20", "Time": "1725793370.194168", "DateTime": "2024-09-08T11:02:50Z", "Type": "DBA_TagNotPresentInStatus", "ID": "0000000000000000", "Tag": "default", "Context": "dr_backup_dest", "ThreadID": "114316097773912472", "Machine": "10.129.244.16:49211", "LogGroup": "default", "ClientDescription": "primary-7.1.49-114316097773912472" }

I’ve since verified this. The new DB seems to have no knowledge that it is a destination for DR replication (fdbcli --exec 'status details', and also combing through the output of status json). But fdbdr status ... still says that the destination is a complete copy of the DB ~1.5s behind the source, and the destination does seem to be getting new keys from the source written to it.

At this point I’m assuming (but haven’t yet tried) I should be able to fdbdr abort ... on the secondary cluster, switch our application back to the primary region (there might be a small amount of data loss, but I don’t care overmuch because it’s dev), and then wipe the secondary and re-setup the DR sync in the intended direction.

But am I kinda confused as to how it got into this state in the first place. I unfortunately wiped my terminal history for that tab to make following output from something else easier, so I can’t look back and confirm if the new DB ever recognised that it was a DR destination.

More information from subsequent testing. I can replicate this reliably.

  • status json on the secondary cluster shows both a dr_backup (the current DR process) and dr_backup_dest (the old, aborted, process) block.
  • The same on the primary cluster only shows dr_backup, which has a mutation_stream_id matching that of dr_backup_dest from the secondary. There is no dr_backup_dest.
  • The primary cluster does have a DB lock applied, which was presumably done as a result of the new fdbdr process being started.

I am confused how the primary cluster still has a record of being a source for the old DR process, given that all the volumes were formatted and then the instances terminated and new ones created. All data was lost.

I have a theory. The fdb.cluster file contains data which looks like seed:<ID>@IP1:port,IP2:port.... I’ve done a quick test, and given that we are statically assigning an ENI for the coordinator instances, the IP:port values are identical between the old and new clusters. This has resulted in the <ID> also being identical, even though I completely wiped the state data and recreated the DB.

Is the DR state in some way tied to that cluster ID? And there’s some bug related to stopping then recreating a DR setup with the same (but inverted) source and destination IDs?

I could potentially completely destroy all resources for the primary cluster and recreate them. That would result in a different set of static coordinator IPs being created, and presumably then a different cluster ID.

OK, no, that didn’t work. I got a completely new set of coordinator IPs and somehow the DB still knew it was originally a source for a now-aborted DR process, before I’d restarted.

Both the new primary and old secondary clusters are already running dr_agent, pointed at each other, so I guess the DR agent process is getting that information from the destination cluster (which hasn’t changed coordinator IPs)? How is it associating the old, aborted, process with itself?

More weirdness.

fdbcli --exec 'status details' and fdbcli --exec 'status json' (plus some searching through the resulting doc) both show the destination cluster fluctuating on whether it’s a DR replica or not. Sometimes it reports that it is, sometimes not. These are ephemeral instances configured by script, so all the instances should have identical configuration and setup, but it feels like it’s dependent on which agent most recently updated the magic keys in the DB, and some agents are correctly connected but others are not…