Fairly large time to become available after fallback in a multi region FDB

mpatou_openai · May 22, 2025, 9:18pm

Yesterday we ran a test for failover and fallback of FDB, failover went fairly smoothly.
But fallback not as much.
We had a setup with 2 FDB regions each with a main (where you have the whole set of roles/processes: storage, log, coordinators & stateless) and a satellite.
In the region with the highest priority we disconnected the main completely causing FDB to automatically fall back on the second region.
Around 3:31PM pst we re-enabled the main in high priority region for roughly ~10 minutes the cluster was stable with all stateless roles still in the lower priority region (ie. controller, master, datastributor, GRV, Commit, …) and then around 3:45PM the roles switched back to the main region, data distributor realized that there was 60GB of data to move and then the DB started to unavailable for ~45 mins.

When I look at the logs it’s not quite clear what really triggered the database to become unavailable, looking at the logs >= 20 it seems that it’s somewhat aligned with

{"date":1747867494.564181,"ID":"0000000000000000","Error":"transaction_timed_out","Roles":"CC","ErrorDescription":"Operation aborted because the transaction timed out","ErrorCode":"1031","Severity":"20","Machine":"10.129.2.168:4500","DateTime":"2025-05-21T22:44:54Z","Type":"LayerStatusError"}

I don’t have a final opinion on why the DB was not available for 45 minutes but I have the feeling looking at the >10 logs that the storage in the main high pri region were lacking some critical data that prevented the DB from being available in this region.

The corollary question is: should FDB fall back to the primary region even if this one is not ready ?

Topic		Replies	Views
Unknown FDB downtime Running FoundationDB	11	728	June 17, 2021
`fdbdr` failover duration longer than expected Running FoundationDB	3	476	April 25, 2022
One faulty stateless pod made cluster unavailable, and one storage server caused cluster slow Using FoundationDB	0	300	November 5, 2022
Simulating FDB data center failure Using FoundationDB performance	4	964	December 9, 2019
Dataloss when doing failover with a multi region cluster? Running FoundationDB	2	75	June 9, 2025

Fairly large time to become available after fallback in a multi region FDB

Related topics