`fdbdr` failover duration longer than expected

We are using fdbdr to fail over between two clusters in order to change the hardware configuration and reduce the cluster size from 15 nodes to 9 nodes with a different FDB class topology. The original cluster and new cluster are not anywhere close to saturated by the workload.

Our application code is configured to know about a “primary” and “secondary” cluster for a given logical cluster name. If the application code sees the “database is locked” error, it retries on the secondary. If it detects both are locked, it logs an error message stating the client thinks both clusters are locked.

We configure the secondary cluster as the DR target relatively soon before initiating the failover, but we wait until the “storage server pending mutations” between the two clusters match.

There was a 2 minute period of downtime between the primary cluster going through a recovery and the secondary cluster going through a recovery where we were unable to start or commit any transactions and received many of our “both are locked” errors.

The write throughput of this cluster is relatively low at 2-3MB/s, but the read traffic is >200MB/s.

This is the “storage server pending mutations” graph over roughly 30 minutes. You can see we waited until both clusters stabilized to do the failover, which happened at the beginning of the dip which lasted 2 minutes.

This is graph showing the two recoveries per cluster which happened over that two minute period. The left bars are the original primary and the right bars are the secondary cluster.

Is this the expected behavior? It seems like if we wait for the “storage server pending mutations” to match between the two clusters, the failover itself shouldn’t take nearly this long.

Do you happen to know or have a status from before the switch that shows what the mutation lag was between the two clusters? A large lag could translate into a log downtime during the switchover. Two minutes of lag in otherwise healthy clusters and modest workload sounds long, but at the very least knowing the answer might help suggest where to look next.

I have all the trace logs, and I can graph the values for a given field over time for each cluster. For example, here is a diff of TLogMetric.KnownCommittedVersion between the two clusters.

The formula there is b-a where a is the primary and b is the secondary. The switch happened at 9:51:30-9:52 roughly.

I can also graph different metrics for each cluster or other event types if that helps.

I’m not actually sure what metrics from the trace logs would indicate the DR lag, as I’ve only even monitored it in status. Possibly it is something logged by the DR agents themselves, do you have logs for those?