Very weird one here, looking for any insight into how it might have happened.
We’ve got 2 FDB clusters in our development environment, in 2 regions. The primary is normally set to replicate to the secondary. I manually rolled out a change I was testing which had worked in a new test cluster to our primary dev cluster, the result of which was to utterly destroy the primary cluster (I mistakenly triggered a format of the volumes on which the FDB data directory was held). No worries, this is why we have a dev env and DR.
- On the secondary cluster, I ran
fdbdr abort --dstonly --source <primary_cluster_file> --destination <secondary_cluster_file>
- I (re)created a new DB on the original primary cluster, then on the secondary I ran
fdbdr start --source <secondary_cluster_file> --destination <primary_cluster_file>
- I verified that the DR was running, and that the sum of key/value storage on the new DB was increasing.
fdbdr status ...
was reporting that the DR was not a complete copy (as expected). - After some time,
fdbdr status ...
was reporting that DR was a complete copy, and that the destination was ~1.5s behind. Checking the key/value sizes on both clusters gave expected and very close amounts too (our application was running from the secondary region at this point, so it was constantly having data written to it to sync across). - I ran
fdbdr switch --source <secondary_cluster_file> --destination <primary_cluster_file>
- I got a fatal error from
fdbdr
…
Running with trace logging enabled on the fdbdr
command, it would seem the actual error is:
{ "Severity": "20", "Time": "1725793370.194168", "DateTime": "2024-09-08T11:02:50Z", "Type": "DBA_TagNotPresentInStatus", "ID": "0000000000000000", "Tag": "default", "Context": "dr_backup_dest", "ThreadID": "114316097773912472", "Machine": "10.129.244.16:49211", "LogGroup": "default", "ClientDescription": "primary-7.1.49-114316097773912472" }
I’ve since verified this. The new DB seems to have no knowledge that it is a destination for DR replication (fdbcli --exec 'status details'
, and also combing through the output of status json
). But fdbdr status ...
still says that the destination is a complete copy of the DB ~1.5s behind the source, and the destination does seem to be getting new keys from the source written to it.
At this point I’m assuming (but haven’t yet tried) I should be able to fdbdr abort ...
on the secondary cluster, switch our application back to the primary region (there might be a small amount of data loss, but I don’t care overmuch because it’s dev), and then wipe the secondary and re-setup the DR sync in the intended direction.
But am I kinda confused as to how it got into this state in the first place. I unfortunately wiped my terminal history for that tab to make following output from something else easier, so I can’t look back and confirm if the new DB ever recognised that it was a DR destination.