We’ve been running FoundationDB 7.1.X in production in AWS in three_data_hall mode treating each AWS availability zone as a data hall (locality-data-hall=<aws az> in the config passed to fdbmonitor).
Our topology is:
| Class | Nodes Per Az | Processes Per Node | Total Processes Per Cluster |
|---|---|---|---|
| Coordinator | 3 | 1 | 9 |
| Stateless | 1 | 2 | 6 |
| Log | 3 | 1 | 9 |
| Storage | 2 | 1 | 6 |
On the whole we’re satisfied with the performance and manageability of the cluster. We practice chaos engineering, when killing one node we never see anything more than a minor self repair, all good.
We’ve expanding our chaos engineering to terminate entire availability zones, we’re doing this by isolating the subnets our instances use (rather than terminating instances - allowing us to keep the instances offline longer that a straight terminate).
When we terminate an AZ the cluster enters recovery, and continues to operate, accepting commits, responding with little degradation from the clients perspective.
However I’m not overly confident in my interpretation the data presented in status details or status json.
After terminating the availability zone we observe the following:
- The number of
.cluster.data.moving_data.in_flight_bytesandin_queue_bytesjumps immediately looking to redistribute the data between the remaining nodes (presumably to restore the 3rd copy) - however it never decreases (how could it - there isn’t a storage node that could satisfy the 3rd place constraint). .cluster.logshas six entries, presumably every iteration of it redetermining the log processes after each topology change - only the oldest entry haslog_interfaceswhich show ashealthy: false.- The nodes of the
logclass show multiplelogroles being assigned to each of them reflecting the different.cluster.logsentries. - The
logclass nodes also now have astoragerole present on each, but the roles KV storage values never increase.
It feels like FDB is trying to fix something it can’t. The good news is once the previous AZ recovers, the cluster performs a recovery and everything goes back to normal - with some minor redistribution of data.
We do see some “interesting” log messages, such as:
{ "Status": "Critical", "evt": { "name": "BestTeamStuck" }, "TeamCollectionId": "1", "Severity": "30", "Time": "1694511836.066214", "Roles": "CP,DD,MS,RK", "StuckCount": "566", "DateTime": "2023-09-12T09:43:56Z", "Machine": "10.1.245.254:4500", "Type": "BestTeamStuck", "SuppressedEventCount": "15", "AnyDestOverloaded": "0", "ThreadID": "9660007800948023369", "LogGroup": "main-dev", "machine": { "port": "4500", "ip": "10.1.245.254" }, "ID": "ca961d9503542afc", "NumOfTeamCollections": "1", "DestOverloadedCount": "0" }
And…
{ "Status": "Critical", "evt": { "name": "BuildTeamsNotEnoughUniqueMachines" }, "Severity": "30", "Time": "1694511835.592940", "Roles": "DD,MS,RK", "DateTime": "2023-09-12T09:43:55Z", "Machine": "10.1.246.254:4501", "Type": "BuildTeamsNotEnoughUniqueMachines", "ThreadID": "4135398715541380937", "LogGroup": "main-dev", "machine": { "port": "4501", "ip": "10.1.246.254" }, "Primary": "1", "Replication": "3", "ID": "f28381eec404177b", "UniqueMachines": "0" }
I guess the short question is - is this all expected? Obviously a failure of 1/3rd of a cluster is a critical event and needs attention, but I’m unsure how to determine if the cluster is stable in this condition.
I hadn’t expected the generations of logs to keep building everytime there was a topology change - though, is this expected because the original logs have data queued for storage nodes that are unavailable?
Is there a realistic maximum amount of time the cluster can sit like this?