Yesterday we ran a test for failover and fallback of FDB, failover went fairly smoothly.
But fallback not as much.
We had a setup with 2 FDB regions each with a main (where you have the whole set of roles/processes: storage, log, coordinators & stateless) and a satellite.
In the region with the highest priority we disconnected the main completely causing FDB to automatically fall back on the second region.
Around 3:31PM pst we re-enabled the main in high priority region for roughly ~10 minutes the cluster was stable with all stateless roles still in the lower priority region (ie. controller, master, datastributor, GRV, Commit, …) and then around 3:45PM the roles switched back to the main region, data distributor realized that there was 60GB of data to move and then the DB started to unavailable for ~45 mins.
When I look at the logs it’s not quite clear what really triggered the database to become unavailable, looking at the logs >= 20 it seems that it’s somewhat aligned with
{"date":1747867494.564181,"ID":"0000000000000000","Error":"transaction_timed_out","Roles":"CC","ErrorDescription":"Operation aborted because the transaction timed out","ErrorCode":"1031","Severity":"20","Machine":"10.129.2.168:4500","DateTime":"2025-05-21T22:44:54Z","Type":"LayerStatusError"}
I don’t have a final opinion on why the DB was not available for 45 minutes but I have the feeling looking at the >10 logs that the storage in the main high pri region were lacking some critical data that prevented the DB from being available in this region.
The corollary question is: should FDB fall back to the primary region even if this one is not ready ?