Converting cluster from non-TLS to TLS seems to get stuck on coordinator change

hxu · May 29, 2025, 4:14pm

Hi all, we’re trying to turn on TLS in an existing three data hall cluster, and the process seems to get stuck moving the operators off the non-TLS processes.

Here’s our versions:

Operator version - v2.3.0
FDB version - 7.1.67

We’ve mounted the certs on the operator, the FDB containers, and the sidecar containers. Our cluster is running in three data hall mode with 9 storage and 9 log servers configured in databaseConfiguration, and with the rest as defaults. So we end up with something like 9 storage + 9 log + 9 stateless servers in the cluster.

What seems to happen is that new TLS processes spin up and successfully join the cluster (we can see them in status), and all of the non-TLS processes get excluded and successfully removed from the cluster, except for the 8 of the 9 coordinators that are running non-TLS only processes. They are excluded, but the operator seems to refuse to remove them or change coordinators because they are coordinators. Somehow one of the coordinators is TLS enabled, but it’s address in the connection string is the non-TLS port.

I’m not sure if it makes any difference, but all of the coordinators are on the log servers, whilst the coordinator that was able to update to the TLS configuration is a storage server.

Some examples of log lines from the operator that we see while it’s stuck, roughly in sequence for each reconciliation run:

IncorrectProcess - for the coordinators that are stuck as non-TLS processes
Process has invalid IP address - again, for the stuck coordinators
Cluster has an unhealthy coordinator - for all but the coordinator that got the TLS process
Cluster has not enough running coordinators - this says that we have 0 running coordinators, I assume because they are all market unhealthy
Deferring coordinator change - this one is confusing to me. I believe the cluster needs new coordinators at this stage, yet CheckCoordinatorValidity reports that not all of the coordinator addresses are valid because of the non-TLS processes, and this causes the operator to not change the coordinators.
Block removal of Coordinator - The operator also doesn’t want to remove the non-TLS processes because they are coordinators.

Is this a bug, or is there something off about our configuration that is hitting an edge case? Is there a manual action that we can take (e.g. manually changing coordinators to include one TLS process) to unstick the process?

johscheuer · June 5, 2025, 7:42am

We have an open issue for converting non-TLS cluster to TLS with the operator: TLS disabling and enabling leads cluster to unavailability · Issue #871 · FoundationDB/fdb-kubernetes-operator · GitHub. If need this feature I’m happy to review your PR.

hxu · June 5, 2025, 8:45am

Ah, so does this mean that the operator is not able to convert a non-TLS cluster to TLS at this time?

We were thinking that there could be a manual step in the middle to unstick it. Would it suffice if we tried the conversion with the operator, and then when it is stuck, to manually change the coordinators to include one with TLS enabled?

johscheuer · June 5, 2025, 2:11pm

From the GitHub issue above:

In previous tests, running a kill all makes the database available.

So connecting to the cluster and running kill; kill all; sleep 10; status should work.

hxu · June 18, 2025, 8:05am

Just wanted to report back that we did manage to migrate successfully, but instead of the kill all command, we just did coordinators auto to unstick the coordinator migration.

In fact, the kill; kill all; sleep 10; status caused one test cluster to become unrecoverable, I think because we lost the cluster coordinator.

Some more context about how our deployment of FoundationDB works. We use a helm chart based on the samples to deploy the operator and a cluster. When we upgrade the helm chart, it creates a totally new set of pods with the new spec, and then gradually removes the old pods from the cluster. The certs are created as a Secret in our cluster and then mounted to the pods as part of our pod spec in the FoundationDBCluster resource.

I believe what this means is that the old pods without TLS will never have the right certs, because they aren’t mounted on the pods. So if we kill the processes on these pods, then they will restart but still are not be able to connect with TLS.

I think what caused the cluster loss was that our cluster controller or all of our coordinators were still on the non-TLS pods, so when I restarted the processes, the new pods came up and maybe overwrote the coordinator state? I’m not totally sure. Here’s what our status showed, and I never got it to recover from this state (hostnames redacted):

>>> kill all
Attempted to kill 119 processes
>>> sleep 10

WARNING: Long delay (Ctrl-C to interrupt)
>>> status

Using cluster file `/tmp/fdb.cluster'.

Locking coordination state. Verify that a majority of coordination server
processes are active.

  <hostname>:4501  (reachable)
  <hostname>:4501  (reachable)
  <hostname>:4501  (reachable)
  <hostname>:4501  (reachable)
  <hostname>:4501  (reachable)
  <hostname>:4501  (reachable)
  <hostname>:4501  (reachable)
  <hostname>:4501  (reachable)
  <hostname>:4501  (reachable)

Unable to locate the data distributor worker.

Unable to locate the ratekeeper worker.

I also saw from our logging that the cluster controller role seemed to jump between several servers when I issued the kill command.

What ended up working was this:

Upgrade the helm chart. This creates a bunch of new pods with the TLS certs and listening on both TLS and non-TLS ports
We wait for a while. The operator gradually excludes the non-TLS processes, and we can see in the status details output that the cluster starts getting more machines on TLS ports
Once all of the non-TLS processes are excluded, and the cluster controller is a TLS process (this seems to happen automatically?), then we run coordinators auto from the CLI
This picks 5 coordinators (we actually want 9, since we are in three data hall. I’m not sure why FDB picks only 5), which will be TLS processes. More importantly, this unsticks the operator, and the operator is able to then try to pick 9 new coordinators from the non-excluded processes (which are all TLS machines at this point)

After that, the operator seems to be able to do its thing and eventually reconciles the database.

Topic		Replies	Views
TLS Mixed Cluster v7.3 Running FoundationDB	0	60	August 20, 2024
Issues with V6.2 TLS Cluster Using FoundationDB	11	1608	January 28, 2020
TLS changes, would it need manual restart or automatic restart of operator? Kubernetes Operator	4	308	May 23, 2023
Controller errors when enabling tls with the kubernetes operator Kubernetes Operator	1	1366	February 28, 2020
TLS setup on Mac and Linux Using FoundationDB	10	957	March 5, 2019

Converting cluster from non-TLS to TLS seems to get stuck on coordinator change

Related topics