We deploy our FDB clusters in 2 regions, with region1 consists of DC1 and DC2 (or AZ1 and AZ2) and region2 with DC3.
The company is mandating more frequent OS patching for the underlying physical machines, upon which our FDB Kubernetes pods create. Currently we employ the following approach to accommodate OS patching for large FDB clusters (>100 pods at DC1):
- Shutdown containers at DC1 (let pods to run a dummy image).
- FDB would fail over to DC3 and run in one region/DC mode.
- Let the Kubernetes team run OS patching on nodes at DC1 in parallel.
- After patching finishes, bring up containers at DC1.
- FDB will restore to 2 region mode.
Questions regarding the auto failover process:
What will happen, if in Step 3 above, some pods in DC1 actually revived? Will FDB recruit them as part of the cluster? Will FDB fail over back to DC1 if enough pods have revived (we want to avoid this during step 3)?
What are the criteria FDB uses for auto fail over? 50% of nodes bad?
In Step 5, FDB will do a sync from DC3 to DC1. What criteria does FDB use to initialize an incremental sync vs. a full refresh? Elapsed time, amount of transaction logs?