Converting cluster from non-TLS to TLS seems to get stuck on coordinator change

Hi all, we’re trying to turn on TLS in an existing three data hall cluster, and the process seems to get stuck moving the operators off the non-TLS processes.

Here’s our versions:

  • Operator version - v2.3.0
  • FDB version - 7.1.67

We’ve mounted the certs on the operator, the FDB containers, and the sidecar containers. Our cluster is running in three data hall mode with 9 storage and 9 log servers configured in databaseConfiguration, and with the rest as defaults. So we end up with something like 9 storage + 9 log + 9 stateless servers in the cluster.

What seems to happen is that new TLS processes spin up and successfully join the cluster (we can see them in status), and all of the non-TLS processes get excluded and successfully removed from the cluster, except for the 8 of the 9 coordinators that are running non-TLS only processes. They are excluded, but the operator seems to refuse to remove them or change coordinators because they are coordinators. Somehow one of the coordinators is TLS enabled, but it’s address in the connection string is the non-TLS port.

I’m not sure if it makes any difference, but all of the coordinators are on the log servers, whilst the coordinator that was able to update to the TLS configuration is a storage server.

Some examples of log lines from the operator that we see while it’s stuck, roughly in sequence for each reconciliation run:

  • IncorrectProcess - for the coordinators that are stuck as non-TLS processes
  • Process has invalid IP address - again, for the stuck coordinators
  • Cluster has an unhealthy coordinator - for all but the coordinator that got the TLS process
  • Cluster has not enough running coordinators - this says that we have 0 running coordinators, I assume because they are all market unhealthy
  • Deferring coordinator change - this one is confusing to me. I believe the cluster needs new coordinators at this stage, yet CheckCoordinatorValidity reports that not all of the coordinator addresses are valid because of the non-TLS processes, and this causes the operator to not change the coordinators.
  • Block removal of Coordinator - The operator also doesn’t want to remove the non-TLS processes because they are coordinators.

Is this a bug, or is there something off about our configuration that is hitting an edge case? Is there a manual action that we can take (e.g. manually changing coordinators to include one TLS process) to unstick the process?

We have an open issue for converting non-TLS cluster to TLS with the operator: TLS disabling and enabling leads cluster to unavailability · Issue #871 · FoundationDB/fdb-kubernetes-operator · GitHub. If need this feature I’m happy to review your PR.

Ah, so does this mean that the operator is not able to convert a non-TLS cluster to TLS at this time?

We were thinking that there could be a manual step in the middle to unstick it. Would it suffice if we tried the conversion with the operator, and then when it is stuck, to manually change the coordinators to include one with TLS enabled?

From the GitHub issue above:

In previous tests, running a kill all makes the database available.

So connecting to the cluster and running kill; kill all; sleep 10; status should work.