Cluster unhealthy after node replacement (which necessitated pod replacement)

funkypenguin · August 12, 2023, 12:54am

Hi folks,

We’re on operator v1.11.0, and recently started the exercise of replacing our Kubernetes nodes for OS upgrades. After replacing several nodes, one of our clusters is now unhealthy, and I’m looking for ideas re how to resolve it.

We’re using the following in the cluster CR:

  routing:
    headlessService: true
    publicIPSource: service

So a service is created for each storage pod, like this:

root@cowboy:/tmp# k get svc -n dev  | grep retort
retort-fdb-storage-1          ClusterIP   10.104.240.116   <none>        4500/TCP,4501/TCP                                       235d
retort-fdb-storage-10         ClusterIP   10.98.164.89     <none>        4500/TCP,4501/TCP                                       47m
retort-fdb-storage-2          ClusterIP   10.98.136.60     <none>        4500/TCP,4501/TCP                                       235d
retort-fdb-storage-3          ClusterIP   10.102.39.88     <none>        4500/TCP,4501/TCP                                       184d
retort-fdb-storage-4          ClusterIP   10.111.121.156   <none>        4500/TCP,4501/TCP                                       406d

In this case, the retort-fdb-storage-2 pod was delete when its underlying kubernetes node was replaced (the replacement happens to have the same node name, but was removed from the kubernetes cluster and re-created), but the service persists (above).

When I exec into one of the FDB pods and run status inside fdbcli, I see that the clusterfile still points to the service IP assigned to retort-fdb-storage-2:

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.96.134.14:4501  (reachable)
  10.98.136.60:4501  (unreachable)
  10.102.39.88:4501  (reachable)
  10.104.240.116:4501  (reachable)
  10.111.121.156:4501  (reachable)

And:

Cluster description: retort_fdb
Cluster coordinators (5): 10.102.39.88:4501,10.104.240.116:4501,10.111.121.156:4501,10.96.134.14:4501,10.98.136.60:4501
Type `help coordinators' to learn how to change this information.
fdb>

If I use the kubectl plugin to analyze the cluster, I get:

root@dog1:~# kubectl-fdb analyze -n dev retort-fdb
Checking cluster: dev/retort-fdb
✖ Cluster is not available
✖ Cluster is not fully replicated
✖ Cluster is not reconciled
✖ ProcessGroup: storage-9 has the following condition: MissingProcesses since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-9 has the following condition: MissingPod since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-9 has the following condition: MissingPVC since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-10 has the following condition: MissingProcesses since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-10 has the following condition: MissingPod since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-10 has the following condition: MissingPVC since 2023-08-12 00:02:44 +0000 UTC
⚠ Ignored 5 process groups marked for removal
✔ Pods are all running and available
Error:
found issues for cluster retort-fdb. Please check them
root@dog1:~#

Since presumably 4/5 of the nodes are happy, how should I proceed?

Thanks!
D

johscheuer · August 22, 2023, 6:17am

Are you cordoning those nodes before taking them down? Or what is the actual process behind “replacing” a node?

Is this cluster TLS enabled? Could it be some issues with the certificates on the newly created nodes? Probably the best is to check the trace events of this cluster to get a better idea why the connection is not possible. And have you checked that the service is pointing to the right endpoint, e.g. the new Pod IP address?

Topic		Replies	Views
Failure / Recovery scenario Kubernetes Operator	1	692	October 12, 2020
FDB Cluster upgrade does not deletes old pods Kubernetes Operator	1	209	July 24, 2023
Stateless node keep reaching out to removed storage node Kubernetes Operator operator	0	389	April 14, 2022
Foundationdb cluster became unavailable after shutting down 1 az Using FoundationDB	0	249	August 23, 2023
K8s operator fdb.cluster IP addresses issue Kubernetes Operator operator	5	1210	September 18, 2020

Cluster unhealthy after node replacement (which necessitated pod replacement)

Related topics