Migrating a triple cluster to a three_data_hall cluster without unavailability

gm42 · November 18, 2024, 3:23pm

Hi there,

after a lengthy development process I can confirm that it is possible to migrate a triple cluster to a three_data_hall cluster without going through any unavailability.

Let me enumerate the steps:

create 3 clones of the original k8s FoundationDB cluster object, thus still with triple replication, and set a different name/processGroupIDPrefix/dataHall/dataCenter for them; these cluster objects must also have skip: true upon creation via kubectl create so that operator will not attempt to configure them and they must have their seed connection string set to the original cluster
on these 3 new FoundationDB cluster objects set the configured state and connection string in their status subresource: kubectl patch fdb my_new_cluster... --type=merge --subresource status --patch "status: {configured: true, connectionString: \"...\" }"
set back skip: false
start a lengthy exclude procedure that will exclude all the processes of the original cluster; I exclude them in this order: log, storage, coordinator, stateless
delete the original cluster once all exclusions are complete
set redundancyMode to three_data_hall for the 3 new per-hall FoundationDB cluster objects, one after another
patch seed connection string of two of the 3 per-hall clusters to point to the one which will keep an empty seed connection string e.g. if you have A,B,C, set the seed connection string of B and C to point to A and make sure that A has no seed connection string; this step is not crucial but practically helpful sometimes

Shall I contribute this to the documentation? Not sure if it might be found useful by some other party.

simenl · December 19, 2024, 12:25pm

Thought I could share our process: We’ve been migrating to three_data_hall in the following way with the kubernetes operator:

Add the second and third foundationdb k8s resource with seedConnectionString set to the connection string of the running foundationdb cluster. These resources will now join the foundationdb cluster. Set name /processGroupIDPrefix /dataHall according to the ‘data hall’/availability zone.
Scale the original foundationdb k8s resource such that the cluster is of desired size (we scale down to ~1/3). Optionally remove the seedConnectionString of the second and third foundationdb k8s resource, as they are no longer used.
Update processGroupIDPrefix , dataHall, and nodeSelector (if applicable), of the original foundationdb k8s resource.
Switch redundancy_mode to three_data_hall in all three foundationdb k8s resources.

We use the nodeSelector to make sure that pods are in the right availability zone.

johscheuer · December 20, 2024, 2:12pm

@simenl do you mind to create a PR with those steps in the operator repo? Would be nice to have them documented in the operator

btw. I’m in the process to update our docs for three_data_hall to make sure we recommend the unified image (assuming that you’re able to get node read access) this is the simplest way to run a FDB cluster with three_data_hall in the same Kubernetes cluster (and namespace): Update the three_data_hall docs for the three data hall with unified image by johscheuer · Pull Request #2188 · FoundationDB/fdb-kubernetes-operator · GitHub feedback is welcome (I’ll open another PR with some additional documentation).

gm42 · January 6, 2025, 11:04am

@johscheuer I have posted here the steps I used, which include more checks/changes as it was necessary due to various failure modes discovered in the process of migration.

Shall I create a PR to document those instead?

johscheuer · January 9, 2025, 11:31am

That would be great!

gm42 · January 14, 2025, 1:20pm

Opened PR here: docs: document procedure to migrate a live cluster to three-data-hall redundancy by gm42 · Pull Request #2191 · FoundationDB/fdb-kubernetes-operator · GitHub

Feedback welcome!

Topic		Replies	Views
Operator release 1.27 supports three_data_hall Kubernetes Operator	0	256	October 24, 2023
Example on setting up triple_data_hall with FDB operator Kubernetes Operator operator	7	719	November 19, 2021
Three_data_hall coordinators Kubernetes Operator	6	281	November 21, 2023
Does K8s Operator allow fully resuming after terminating and recreating all fbd pods Kubernetes Operator operator	6	402	September 21, 2023
Run FoundationDB cluster on multi Kuberbetes clusters Kubernetes Operator	20	1782	February 6, 2023

Migrating a triple cluster to a three_data_hall cluster without unavailability

Related topics