[Bugs] DR with multiple dr_agent

hikouki-gumo · June 7, 2025, 3:35am

Just ran into a pretty interesting incident with our FoundationDB DR cluster, and I want to dig deep into how fdb_dr works internally after this. For us, the DR cluster is just as critical as the primary.

Reproduce

Setup: 2 FDB cluster, 1 primary and 1 DR cluster.
Run 3 agent with correct source connection string (primary) and dest connection string (DR).
Update 1 agent with wrong source connection string (primary). I don’t want to detail how this can happen in reality, but for our case, it relates to coordinator change → staled connection string.

Actual behavior

Our DR cluster completely fell behind! Even though dr_backup.instances_running reported 2, the DR cluster just couldn’t keep up with the primary.

Expected behavior

The DR cluster should have continued syncing normally. Maybe at a slower pace, sure, since one agent was out of commission, but it should still have been running. The instances_running metric being 2 kinda reinforced that expectation.

This feels like a bug to me. I’m really keen to hear what others think and, more importantly, get a deeper dive into how fdb_dr agents cooperate behind the scenes. Any insights on that collaboration would be hugely appreciated!

Topic		Replies	Views
DR Explanations/Help Using FoundationDB	1	345	January 20, 2023
Fdbdr done but created a huge storage lag Running FoundationDB performance	0	398	May 4, 2022
Is it possible to DR to another version of FDB cluster? Using FoundationDB	3	319	November 23, 2022
Dr_agent - source and destination Using FoundationDB	2	341	February 14, 2023
Multiple DR running simultaneously Using FoundationDB performance	4	758	July 13, 2020

[Bugs] DR with multiple dr_agent

Related topics