Continuously increasing DR lag after migrating from FDB 6.2 to 7.3

Hi folks,
We’ve been running into issues with FoundationDB DR after migrating from 6.2 to 7.3. The DR lag in 7.3 continuously increases and never recovers on its own. Aborting the DR and restarting it doesn’t help. Whereas with the same workload on 6.2, the lag stays mostly below 200ms.

Some additional details (I can provide more if needed):

  • Workloads:

    • Source clusters:
      • FDB 7.3:
        • Reads: 11 MiB/s - 3.6k ops/s
        • Writes: 600 KB/s - 3.9k ops/s
      • FDB 6.2:
        • Reads: 34 MiB/s - 3.1k ops/s
        • Writes: 600 KB/s - 3.7k ops/s
    • Destination clusters:
      • FDB 7.3:
        • Reads: 3.5 MiB/s - 1.5k ops/s
        • Writes: 1.2 MiB/s - 5.2k ops/s
      • FDB 6.2:
        • Reads: 2.6 MiB/s - 1.2k ops/s
        • Writes: 1.4 MiB/s - 6k ops/s
  • Clusters are the same size:

    • 36 storage servers on each
    • 4 log servers
    • 3 proxy (1 grv - 3 commit for the 7.3)
    • Increasing the log servers and the commit proxy didn’t help
  • System data kept continuously growing as the lag increased, until reaching 15GB (surpassing the source cluster data size) before we stopped writes. In 6.2, the lag remains below a few MBs.

  • Commit proxy latency was around 150ms for 7.3 in comparison to a few milliseconds fro 6.2.

  • We have 3 dr agents running

We also noticed that FDB 7.x introduced knobs related to increasing the amount of logs to copy for DR (Introduced by this PR). We tried tuning knob_copy_log_block_size and knob_copy_log_blocks_per_task to better match the 6.2 behavior but it didn’t noticeably improve the situation. We tried setting the block size to 100000 , which was the block size in 6.2, and the number of blocks per task to values like 100,1,1000. What we’ve noticed however:

After some time of catching up from the destination cluster (never reached the source cluster size), data moving inflight stopped and the amount of bytes read on the source cluster jumped to 3GB/s.

Has anyone encountered similar behavior with DR in 7.x, or seen regressions compared to 6.2?
Any guidance on what to look at next or how to setup the knobs would be appreciated.
Thanks

Sharing some more details for Hussein because we can’t publish more than one image.

Here is the graph showing the sum of cluster.machines.<machine_id>.network.megabits_sent.hz (as described in Monitored Metrics — FoundationDB 7.4.5 documentation )

Blue line is source, purple is destination.

It the surge seems to happen when the DR switches from the snapshot loading to applying the latest transactions.

We’re seeing 30GiB/s transfer from a source cluster of around 500GiB data, so most of that transfer must be wasted.