Cluster unhealthy after node replacement (which necessitated pod replacement)

Hi folks,

We’re on operator v1.11.0, and recently started the exercise of replacing our Kubernetes nodes for OS upgrades. After replacing several nodes, one of our clusters is now unhealthy, and I’m looking for ideas re how to resolve it.

We’re using the following in the cluster CR:

  routing:
    headlessService: true
    publicIPSource: service

So a service is created for each storage pod, like this:

root@cowboy:/tmp# k get svc -n dev  | grep retort
retort-fdb-storage-1          ClusterIP   10.104.240.116   <none>        4500/TCP,4501/TCP                                       235d
retort-fdb-storage-10         ClusterIP   10.98.164.89     <none>        4500/TCP,4501/TCP                                       47m
retort-fdb-storage-2          ClusterIP   10.98.136.60     <none>        4500/TCP,4501/TCP                                       235d
retort-fdb-storage-3          ClusterIP   10.102.39.88     <none>        4500/TCP,4501/TCP                                       184d
retort-fdb-storage-4          ClusterIP   10.111.121.156   <none>        4500/TCP,4501/TCP                                       406d

In this case, the retort-fdb-storage-2 pod was delete when its underlying kubernetes node was replaced (the replacement happens to have the same node name, but was removed from the kubernetes cluster and re-created), but the service persists (above).

When I exec into one of the FDB pods and run status inside fdbcli, I see that the clusterfile still points to the service IP assigned to retort-fdb-storage-2:

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.96.134.14:4501  (reachable)
  10.98.136.60:4501  (unreachable)
  10.102.39.88:4501  (reachable)
  10.104.240.116:4501  (reachable)
  10.111.121.156:4501  (reachable)

And:

Cluster description: retort_fdb
Cluster coordinators (5): 10.102.39.88:4501,10.104.240.116:4501,10.111.121.156:4501,10.96.134.14:4501,10.98.136.60:4501
Type `help coordinators' to learn how to change this information.
fdb>

If I use the kubectl plugin to analyze the cluster, I get:

root@dog1:~# kubectl-fdb analyze -n dev retort-fdb
Checking cluster: dev/retort-fdb
✖ Cluster is not available
✖ Cluster is not fully replicated
✖ Cluster is not reconciled
✖ ProcessGroup: storage-9 has the following condition: MissingProcesses since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-9 has the following condition: MissingPod since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-9 has the following condition: MissingPVC since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-10 has the following condition: MissingProcesses since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-10 has the following condition: MissingPod since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-10 has the following condition: MissingPVC since 2023-08-12 00:02:44 +0000 UTC
⚠ Ignored 5 process groups marked for removal
✔ Pods are all running and available
Error:
found issues for cluster retort-fdb. Please check them
root@dog1:~#

Since presumably 4/5 of the nodes are happy, how should I proceed?

Thanks!
D

Are you cordoning those nodes before taking them down? Or what is the actual process behind “replacing” a node?

Is this cluster TLS enabled? Could it be some issues with the certificates on the newly created nodes? Probably the best is to check the trace events of this cluster to get a better idea why the connection is not possible. And have you checked that the service is pointing to the right endpoint, e.g. the new Pod IP address?