Hi folks,
We’re on operator v1.11.0, and recently started the exercise of replacing our Kubernetes nodes for OS upgrades. After replacing several nodes, one of our clusters is now unhealthy, and I’m looking for ideas re how to resolve it.
We’re using the following in the cluster CR:
routing:
headlessService: true
publicIPSource: service
So a service is created for each storage pod, like this:
root@cowboy:/tmp# k get svc -n dev | grep retort
retort-fdb-storage-1 ClusterIP 10.104.240.116 <none> 4500/TCP,4501/TCP 235d
retort-fdb-storage-10 ClusterIP 10.98.164.89 <none> 4500/TCP,4501/TCP 47m
retort-fdb-storage-2 ClusterIP 10.98.136.60 <none> 4500/TCP,4501/TCP 235d
retort-fdb-storage-3 ClusterIP 10.102.39.88 <none> 4500/TCP,4501/TCP 184d
retort-fdb-storage-4 ClusterIP 10.111.121.156 <none> 4500/TCP,4501/TCP 406d
In this case, the retort-fdb-storage-2
pod was delete when its underlying kubernetes node was replaced (the replacement happens to have the same node name, but was removed from the kubernetes cluster and re-created), but the service persists (above).
When I exec into one of the FDB pods and run status
inside fdbcli
, I see that the clusterfile still points to the service IP assigned to retort-fdb-storage-2
:
Could not communicate with all of the coordination servers.
The database will remain operational as long as we
can connect to a quorum of servers, however the fault
tolerance of the system is reduced as long as the
servers remain disconnected.
10.96.134.14:4501 (reachable)
10.98.136.60:4501 (unreachable)
10.102.39.88:4501 (reachable)
10.104.240.116:4501 (reachable)
10.111.121.156:4501 (reachable)
And:
Cluster description: retort_fdb
Cluster coordinators (5): 10.102.39.88:4501,10.104.240.116:4501,10.111.121.156:4501,10.96.134.14:4501,10.98.136.60:4501
Type `help coordinators' to learn how to change this information.
fdb>
If I use the kubectl plugin to analyze the cluster, I get:
root@dog1:~# kubectl-fdb analyze -n dev retort-fdb
Checking cluster: dev/retort-fdb
✖ Cluster is not available
✖ Cluster is not fully replicated
✖ Cluster is not reconciled
✖ ProcessGroup: storage-9 has the following condition: MissingProcesses since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-9 has the following condition: MissingPod since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-9 has the following condition: MissingPVC since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-10 has the following condition: MissingProcesses since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-10 has the following condition: MissingPod since 2023-08-12 00:02:44 +0000 UTC
✖ ProcessGroup: storage-10 has the following condition: MissingPVC since 2023-08-12 00:02:44 +0000 UTC
⚠ Ignored 5 process groups marked for removal
✔ Pods are all running and available
Error:
found issues for cluster retort-fdb. Please check them
root@dog1:~#
Since presumably 4/5 of the nodes are happy, how should I proceed?
Thanks!
D