We’ve been running FoundationDB 7.1.X in production in AWS in three_data_hall
mode treating each AWS availability zone as a data hall (locality-data-hall=<aws az>
in the config passed to fdbmonitor
).
Our topology is:
Class | Nodes Per Az | Processes Per Node | Total Processes Per Cluster |
---|---|---|---|
Coordinator | 3 | 1 | 9 |
Stateless | 1 | 2 | 6 |
Log | 3 | 1 | 9 |
Storage | 2 | 1 | 6 |
On the whole we’re satisfied with the performance and manageability of the cluster. We practice chaos engineering, when killing one node we never see anything more than a minor self repair, all good.
We’ve expanding our chaos engineering to terminate entire availability zones, we’re doing this by isolating the subnets our instances use (rather than terminating instances - allowing us to keep the instances offline longer that a straight terminate).
When we terminate an AZ the cluster enters recovery, and continues to operate, accepting commits, responding with little degradation from the clients perspective.
However I’m not overly confident in my interpretation the data presented in status details
or status json
.
After terminating the availability zone we observe the following:
- The number of
.cluster.data.moving_data.in_flight_bytes
andin_queue_bytes
jumps immediately looking to redistribute the data between the remaining nodes (presumably to restore the 3rd copy) - however it never decreases (how could it - there isn’t a storage node that could satisfy the 3rd place constraint). .cluster.logs
has six entries, presumably every iteration of it redetermining the log processes after each topology change - only the oldest entry haslog_interfaces
which show ashealthy: false
.- The nodes of the
log
class show multiplelog
roles being assigned to each of them reflecting the different.cluster.logs
entries. - The
log
class nodes also now have astorage
role present on each, but the roles KV storage values never increase.
It feels like FDB is trying to fix something it can’t. The good news is once the previous AZ recovers, the cluster performs a recovery and everything goes back to normal - with some minor redistribution of data.
We do see some “interesting” log messages, such as:
{ "Status": "Critical", "evt": { "name": "BestTeamStuck" }, "TeamCollectionId": "1", "Severity": "30", "Time": "1694511836.066214", "Roles": "CP,DD,MS,RK", "StuckCount": "566", "DateTime": "2023-09-12T09:43:56Z", "Machine": "10.1.245.254:4500", "Type": "BestTeamStuck", "SuppressedEventCount": "15", "AnyDestOverloaded": "0", "ThreadID": "9660007800948023369", "LogGroup": "main-dev", "machine": { "port": "4500", "ip": "10.1.245.254" }, "ID": "ca961d9503542afc", "NumOfTeamCollections": "1", "DestOverloadedCount": "0" }
And…
{ "Status": "Critical", "evt": { "name": "BuildTeamsNotEnoughUniqueMachines" }, "Severity": "30", "Time": "1694511835.592940", "Roles": "DD,MS,RK", "DateTime": "2023-09-12T09:43:55Z", "Machine": "10.1.246.254:4501", "Type": "BuildTeamsNotEnoughUniqueMachines", "ThreadID": "4135398715541380937", "LogGroup": "main-dev", "machine": { "port": "4501", "ip": "10.1.246.254" }, "Primary": "1", "Replication": "3", "ID": "f28381eec404177b", "UniqueMachines": "0" }
I guess the short question is - is this all expected? Obviously a failure of 1/3rd of a cluster is a critical event and needs attention, but I’m unsure how to determine if the cluster is stable in this condition.
I hadn’t expected the generations of logs to keep building everytime there was a topology change - though, is this expected because the original logs have data queued for storage nodes that are unavailable?
Is there a realistic maximum amount of time the cluster can sit like this?