Migrating a triple cluster to a three_data_hall cluster without unavailability

Hi there,

after a lengthy development process I can confirm that it is possible to migrate a triple cluster to a three_data_hall cluster without going through any unavailability.

Let me enumerate the steps:

  1. create 3 clones of the original k8s FoundationDB cluster object, thus still with triple replication, and set a different name/processGroupIDPrefix/dataHall/dataCenter for them; these cluster objects must also have skip: true upon creation via kubectl create so that operator will not attempt to configure them and they must have their seed connection string set to the original cluster
  2. on these 3 new FoundationDB cluster objects set the configured state and connection string in their status subresource: kubectl patch fdb my_new_cluster... --type=merge --subresource status --patch "status: {configured: true, connectionString: \"...\" }"
  3. set back skip: false
  4. start a lengthy exclude procedure that will exclude all the processes of the original cluster; I exclude them in this order: log, storage, coordinator, stateless
  5. delete the original cluster once all exclusions are complete
  6. set redundancyMode to three_data_hall for the 3 new per-hall FoundationDB cluster objects, one after another
  7. patch seed connection string of two of the 3 per-hall clusters to point to the one which will keep an empty seed connection string e.g. if you have A,B,C, set the seed connection string of B and C to point to A and make sure that A has no seed connection string; this step is not crucial but practically helpful sometimes

Shall I contribute this to the documentation? Not sure if it might be found useful by some other party.

1 Like

Thought I could share our process: We’ve been migrating to three_data_hall in the following way with the kubernetes operator:

  1. Add the second and third foundationdb k8s resource with seedConnectionString set to the connection string of the running foundationdb cluster. These resources will now join the foundationdb cluster. Set name /processGroupIDPrefix /dataHall according to the ‘data hall’/availability zone.
  2. Scale the original foundationdb k8s resource such that the cluster is of desired size (we scale down to ~1/3). Optionally remove the seedConnectionString of the second and third foundationdb k8s resource, as they are no longer used.
  3. Update processGroupIDPrefix , dataHall, and nodeSelector (if applicable), of the original foundationdb k8s resource.
  4. Switch redundancy_mode to three_data_hall in all three foundationdb k8s resources.

We use the nodeSelector to make sure that pods are in the right availability zone.

@simenl do you mind to create a PR with those steps in the operator repo? Would be nice to have them documented in the operator :slight_smile:

btw. I’m in the process to update our docs for three_data_hall to make sure we recommend the unified image (assuming that you’re able to get node read access) this is the simplest way to run a FDB cluster with three_data_hall in the same Kubernetes cluster (and namespace): Update the three_data_hall docs for the three data hall with unified image by johscheuer · Pull Request #2188 · FoundationDB/fdb-kubernetes-operator · GitHub feedback is welcome (I’ll open another PR with some additional documentation).

@johscheuer I have posted here the steps I used, which include more checks/changes as it was necessary due to various failure modes discovered in the process of migration.

Shall I create a PR to document those instead?

That would be great!

1 Like

Opened PR here: docs: document procedure to migrate a live cluster to three-data-hall redundancy by gm42 · Pull Request #2191 · FoundationDB/fdb-kubernetes-operator · GitHub

Feedback welcome!