Foundationdb cluster became unavailable after shutting down 1 az

Hello, I have multiple FDB clusters, and some of them have encountered this issue. Any help would be appreciated :slight_smile:

cluster is running across 3 availability zones (az) with 5 machines fore each az, totaling 15 machines. I have used the replication mode of three data halls and have 5 coordinators. and the version is 6.2.

Recently, due to az maintenance, the network in each az was disconnected sequentially for 1-2 days. When az-a was disconnected, there seemed to be okay. However, after az-b was disconnection, the state of some clusters changed to “unavailable”. when I ran fdbcli --exec “status”, it said with long delay “Initializing new transaction servers and recovering transaction logs”. and this message have shown until az-b is connected again.

According to the FDB documentation, the cluster controller is supposed to detect such node failures and assign the necessary roles to other processes. (Architecture — FoundationDB 7.1) My expectation was that, since the cluster controller is located in az-c, it would detect the node failure in az-b, then reassign the roles that were on az-b processes (actually master, ratekeeper, data_distributor were on az-b) to az-a and az-c, and then the master would trigger recovery.

Some clusters seem to have behaved as expected. When checking the status json from fdbcli of the working clusters, I noticed that there were no processes assigned to the node in az-b. However, in the clusters that became “unavailable,” the status json file still showed processes assigned to the node in az-b, which is not reachable. and several machines, including those node from az-b, were marked as “degraded.” Also Inspecting the controller’s log, I found a number of “ConnectionClosed” and “PeerDestroy” messages.

Eventually, when the network in az-b became reachable again, the issue was resolved. All cluster statuses changed to “healthy,” and no machines were marked as “degraded” anymore.

However, I’m curious about why some of the clusters that were inaccessible for maintenance period. As far as I know, three data hall can tolerate single zone failure.
here are 2 status.json files of the cluster when it was inaccessible and when it became healthy. Additionally, I’m including the controller’s log.
Thanks in advance.
(Please note that the IP range 10.130.x.x corresponds to az-a, 10.134.x.x to az-b, and 10.138.x.x to az-c nodes.)