We have monthly OS patching on host machine nodes for our FDB cluster running in Kubernetes. Our FDB cluster is configured to run in 3 datacenters following the asymmetric configuration recommended in the FDB architecture document (Configuration — FoundationDB 6.3). Let’s call the three datacenters DC1, DC2 and DC3. DC1 is the active region for the FDB cluster, and DC3 is the standby region, and DC2 hosts the transaction log servers.
To reduce the cluster disturbance due to host machine maintenance, our solution is to force our FDB cluster to run in a single datacenter mode (say DC1), with the other datacenter (DC 3) being shut down for maintenance.
We also deploy our service nodes in two datacenters (DC1 and DC3), to match the datacenter location in which the storage nodes are hosted (DC1 and DC3). Our service nodes talk to the FDB cluster via the FDB Java client. During the maintenance window, we would like to have our service nodes to automatically detect: (1) whether the FDB cluster runs in a single datacenter mode and (2) in which datacenter (DC 1 or DC3).
By checking the system keys (foundationdb/special-key-space.md at master · apple/foundationdb · GitHub), it seems that to retrieve cluster status information via “status json” is the only way to give us back the result that we need. The other key, “\xff/primaryDatacenter” can only return whether the primary datacenter is in DC1 or DC3.
But “status json” returns all cluster information in bulk, without selective query path being supported today. In one of our largest FDB clusters, the JSON file from “status json” has about 6 MB. We have about 100 service nodes. If all service nodes in the cluster issue “status json” queries in a tight window, it will introduce too much pressure to the FDB server that answers the query.
My questions related to “status json” are:
(1) Is “status json” the only way currently supported by FDB to answer whether FDB is in a single-datacenter mode and which datacenter is active?
(2) In the FDB cluster, does the query “status json” only get answered by the cluster controller or the master, so that we cannot put too much load to such a single server?
(3) What is the acceptable load of “status json” that we can put to the FDB cluster? For example, 5 queries per minutes, each from a different service node? Does the “status json” result get cached at the FDB server side to have better throughput?