What should a data hall failure look like in three_data_hall mode (AWS)?

:wave: We’ve been running FoundationDB 7.1.X in production in AWS in three_data_hall mode treating each AWS availability zone as a data hall (locality-data-hall=<aws az> in the config passed to fdbmonitor).

Our topology is:

Class Nodes Per Az Processes Per Node Total Processes Per Cluster
Coordinator 3 1 9
Stateless 1 2 6
Log 3 1 9
Storage 2 1 6

On the whole we’re satisfied with the performance and manageability of the cluster. We practice chaos engineering, when killing one node we never see anything more than a minor self repair, all good.

We’ve expanding our chaos engineering to terminate entire availability zones, we’re doing this by isolating the subnets our instances use (rather than terminating instances - allowing us to keep the instances offline longer that a straight terminate).

When we terminate an AZ the cluster enters recovery, and continues to operate, accepting commits, responding with little degradation from the clients perspective.

However I’m not overly confident in my interpretation the data presented in status details or status json.

After terminating the availability zone we observe the following:

  • The number of .cluster.data.moving_data.in_flight_bytes and in_queue_bytes jumps immediately looking to redistribute the data between the remaining nodes (presumably to restore the 3rd copy) - however it never decreases (how could it - there isn’t a storage node that could satisfy the 3rd place constraint).
  • .cluster.logs has six entries, presumably every iteration of it redetermining the log processes after each topology change - only the oldest entry has log_interfaces which show as healthy: false.
  • The nodes of the log class show multiple log roles being assigned to each of them reflecting the different .cluster.logs entries.
  • The log class nodes also now have a storage role present on each, but the roles KV storage values never increase.

It feels like FDB is trying to fix something it can’t. The good news is once the previous AZ recovers, the cluster performs a recovery and everything goes back to normal - with some minor redistribution of data.

We do see some “interesting” log messages, such as:
{ "Status": "Critical", "evt": { "name": "BestTeamStuck" }, "TeamCollectionId": "1", "Severity": "30", "Time": "1694511836.066214", "Roles": "CP,DD,MS,RK", "StuckCount": "566", "DateTime": "2023-09-12T09:43:56Z", "Machine": "10.1.245.254:4500", "Type": "BestTeamStuck", "SuppressedEventCount": "15", "AnyDestOverloaded": "0", "ThreadID": "9660007800948023369", "LogGroup": "main-dev", "machine": { "port": "4500", "ip": "10.1.245.254" }, "ID": "ca961d9503542afc", "NumOfTeamCollections": "1", "DestOverloadedCount": "0" }

And…

{ "Status": "Critical", "evt": { "name": "BuildTeamsNotEnoughUniqueMachines" }, "Severity": "30", "Time": "1694511835.592940", "Roles": "DD,MS,RK", "DateTime": "2023-09-12T09:43:55Z", "Machine": "10.1.246.254:4501", "Type": "BuildTeamsNotEnoughUniqueMachines", "ThreadID": "4135398715541380937", "LogGroup": "main-dev", "machine": { "port": "4501", "ip": "10.1.246.254" }, "Primary": "1", "Replication": "3", "ID": "f28381eec404177b", "UniqueMachines": "0" }

I guess the short question is - is this all expected? Obviously a failure of 1/3rd of a cluster is a critical event and needs attention, but I’m unsure how to determine if the cluster is stable in this condition.

I hadn’t expected the generations of logs to keep building everytime there was a topology change - though, is this expected because the original logs have data queued for storage nodes that are unavailable?

Is there a realistic maximum amount of time the cluster can sit like this?