We are using fdbdr
to fail over between two clusters in order to change the hardware configuration and reduce the cluster size from 15 nodes to 9 nodes with a different FDB class topology. The original cluster and new cluster are not anywhere close to saturated by the workload.
Our application code is configured to know about a “primary” and “secondary” cluster for a given logical cluster name. If the application code sees the “database is locked” error, it retries on the secondary. If it detects both are locked, it logs an error message stating the client thinks both clusters are locked.
We configure the secondary cluster as the DR target relatively soon before initiating the failover, but we wait until the “storage server pending mutations” between the two clusters match.
There was a 2 minute period of downtime between the primary cluster going through a recovery and the secondary cluster going through a recovery where we were unable to start or commit any transactions and received many of our “both are locked” errors.
The write throughput of this cluster is relatively low at 2-3MB/s, but the read traffic is >200MB/s.
This is the “storage server pending mutations” graph over roughly 30 minutes. You can see we waited until both clusters stabilized to do the failover, which happened at the beginning of the dip which lasted 2 minutes.
This is graph showing the two recoveries per cluster which happened over that two minute period. The left bars are the original primary and the right bars are the secondary cluster.
Is this the expected behavior? It seems like if we wait for the “storage server pending mutations” to match between the two clusters, the failover itself shouldn’t take nearly this long.