[Bugs] DR with multiple dr_agent

Just ran into a pretty interesting incident with our FoundationDB DR cluster, and I want to dig deep into how fdb_dr works internally after this. For us, the DR cluster is just as critical as the primary.

Reproduce

  • Setup: 2 FDB cluster, 1 primary and 1 DR cluster.

  • Run 3 agent with correct source connection string (primary) and dest connection string (DR).

  • Update 1 agent with wrong source connection string (primary). I don’t want to detail how this can happen in reality, but for our case, it relates to coordinator change → staled connection string.

Actual behavior

Our DR cluster completely fell behind! Even though dr_backup.instances_running reported 2, the DR cluster just couldn’t keep up with the primary.

Expected behavior

The DR cluster should have continued syncing normally. Maybe at a slower pace, sure, since one agent was out of commission, but it should still have been running. The instances_running metric being 2 kinda reinforced that expectation.

This feels like a bug to me. I’m really keen to hear what others think and, more importantly, get a deeper dive into how fdb_dr agents cooperate behind the scenes. Any insights on that collaboration would be hugely appreciated!