Hi all, we’re trying to turn on TLS in an existing three data hall cluster, and the process seems to get stuck moving the operators off the non-TLS processes.
Here’s our versions:
- Operator version - v2.3.0
- FDB version - 7.1.67
We’ve mounted the certs on the operator, the FDB containers, and the sidecar containers. Our cluster is running in three data hall mode with 9 storage and 9 log servers configured in databaseConfiguration
, and with the rest as defaults. So we end up with something like 9 storage + 9 log + 9 stateless servers in the cluster.
What seems to happen is that new TLS processes spin up and successfully join the cluster (we can see them in status
), and all of the non-TLS processes get excluded and successfully removed from the cluster, except for the 8 of the 9 coordinators that are running non-TLS only processes. They are excluded, but the operator seems to refuse to remove them or change coordinators because they are coordinators. Somehow one of the coordinators is TLS enabled, but it’s address in the connection string is the non-TLS port.
I’m not sure if it makes any difference, but all of the coordinators are on the log servers, whilst the coordinator that was able to update to the TLS configuration is a storage server.
Some examples of log lines from the operator that we see while it’s stuck, roughly in sequence for each reconciliation run:
- IncorrectProcess - for the coordinators that are stuck as non-TLS processes
- Process has invalid IP address - again, for the stuck coordinators
- Cluster has an unhealthy coordinator - for all but the coordinator that got the TLS process
- Cluster has not enough running coordinators - this says that we have 0 running coordinators, I assume because they are all market unhealthy
- Deferring coordinator change - this one is confusing to me. I believe the cluster needs new coordinators at this stage, yet
CheckCoordinatorValidity
reports that not all of the coordinator addresses are valid because of the non-TLS processes, and this causes the operator to not change the coordinators. - Block removal of Coordinator - The operator also doesn’t want to remove the non-TLS processes because they are coordinators.
Is this a bug, or is there something off about our configuration that is hitting an edge case? Is there a manual action that we can take (e.g. manually changing coordinators to include one TLS process) to unstick the process?