Just ran into a pretty interesting incident with our FoundationDB DR cluster, and I want to dig deep into how fdb_dr
works internally after this. For us, the DR cluster is just as critical as the primary.
Reproduce
-
Setup: 2 FDB cluster, 1 primary and 1 DR cluster.
-
Run 3 agent with correct source connection string (primary) and dest connection string (DR).
-
Update 1 agent with wrong source connection string (primary). I don’t want to detail how this can happen in reality, but for our case, it relates to coordinator change → staled connection string.
Actual behavior
Our DR cluster completely fell behind! Even though dr_backup.instances_running
reported 2
, the DR cluster just couldn’t keep up with the primary.
Expected behavior
The DR cluster should have continued syncing normally. Maybe at a slower pace, sure, since one agent was out of commission, but it should still have been running. The instances_running
metric being 2
kinda reinforced that expectation.
This feels like a bug to me. I’m really keen to hear what others think and, more importantly, get a deeper dive into how fdb_dr
agents cooperate behind the scenes. Any insights on that collaboration would be hugely appreciated!