Continuously increasing DR lag after migrating from FDB 6.2 to 7.3

Hi folks,
We’ve been running into issues with FoundationDB DR after migrating from 6.2 to 7.3. The DR lag in 7.3 continuously increases and never recovers on its own. Aborting the DR and restarting it doesn’t help. Whereas with the same workload on 6.2, the lag stays mostly below 200ms.

Some additional details (I can provide more if needed):

  • Workloads:

    • Source clusters:
      • FDB 7.3:
        • Reads: 11 MiB/s - 3.6k ops/s
        • Writes: 600 KB/s - 3.9k ops/s
      • FDB 6.2:
        • Reads: 34 MiB/s - 3.1k ops/s
        • Writes: 600 KB/s - 3.7k ops/s
    • Destination clusters:
      • FDB 7.3:
        • Reads: 3.5 MiB/s - 1.5k ops/s
        • Writes: 1.2 MiB/s - 5.2k ops/s
      • FDB 6.2:
        • Reads: 2.6 MiB/s - 1.2k ops/s
        • Writes: 1.4 MiB/s - 6k ops/s
  • Clusters are the same size:

    • 36 storage servers on each
    • 4 log servers
    • 3 proxy (1 grv - 3 commit for the 7.3)
    • Increasing the log servers and the commit proxy didn’t help
  • System data kept continuously growing as the lag increased, until reaching 15GB (surpassing the source cluster data size) before we stopped writes. In 6.2, the lag remains below a few MBs.

  • Commit proxy latency was around 150ms for 7.3 in comparison to a few milliseconds fro 6.2.

  • We have 3 dr agents running

We also noticed that FDB 7.x introduced knobs related to increasing the amount of logs to copy for DR (Introduced by this PR). We tried tuning knob_copy_log_block_size and knob_copy_log_blocks_per_task to better match the 6.2 behavior but it didn’t noticeably improve the situation. We tried setting the block size to 100000 , which was the block size in 6.2, and the number of blocks per task to values like 100,1,1000. What we’ve noticed however:

After some time of catching up from the destination cluster (never reached the source cluster size), data moving inflight stopped and the amount of bytes read on the source cluster jumped to 3GB/s.

Has anyone encountered similar behavior with DR in 7.x, or seen regressions compared to 6.2?
Any guidance on what to look at next or how to setup the knobs would be appreciated.
Thanks

Sharing some more details for Hussein because we can’t publish more than one image.

Here is the graph showing the sum of cluster.machines.<machine_id>.network.megabits_sent.hz (as described in Monitored Metrics — FoundationDB 7.4.5 documentation )

Blue line is source, purple is destination.

It the surge seems to happen when the DR switches from the snapshot loading to applying the latest transactions.

We’re seeing 30GiB/s transfer from a source cluster of around 500GiB data, so most of that transfer must be wasted.

Did you ever get anywhere with this? We’re running FDB DR on 7.3. We only started using FDB at 7.0 and only with any significant data volumes more recently, so we don’t have ‘prior art’ to compare it with, but we’re having serious issues with our DR replication currently.

In our case it seems like the mutation log is being replicated from the source to the target cluster in line with writes to the source, but the application of that mutation log from the system keyspace to the user keyspace on the target cluster seems abysmally slow and none of the relevant knobs seem to affect it much.

We’re running 7.3.37 currently.

1 Like

Hello, we basically have the same experience so far…

Tweaking the knobs did work at first on our side but if we ever have a few nodes dying or some large batch job, the DR can lag for hours/days, which is the equivalent of not having a DR cluster and relying on a fdbrestore process from a backup on s3. Backup, Restore, and Replication for Disaster Recovery — FoundationDB ON documentation

We’re planning on moving to disk snapshots instead of maintaining replica clusters with DR syncing them with primaries. Disk snapshot backup and Restore — FoundationDB ON documentation