I’m trying to use the kubernetes operator.
I’ve successfully set up the cluster ( I hope so ) and I see the pods go up.
for some reason the operator fails on status check:
/var/dynamic-conf/fdb.cluster’.
Could not communicate with a quorum of coordination servers:
12.13.44.16:4501 (unreachable)
12.13.80.35:4501 (unreachable)
12.13.90.14:4501 (unreachable)
These are IPs of older pods that were once up and no longer exist.
How do I have the k8s operator refresh this file for all the cluster?
I’ve tried to delete pods + the operator itself, but when everything comes back again, it’s the same error with the same IPs.
If the pods have all had their IP changed, then recovery will be a complex process. We have an open issue on the broader problem with IP fragility (https://github.com/FoundationDB/fdb-kubernetes-operator/issues/266), but we should get it into documentation as well, until we have a real fix. I’ve never encountered this myself, so I can’t be sure of the correct recovery path. One possible path is:
Manually edit the connection string in the operator status to have the correct IPs
Manually edit the cluster files in /var/fdb/data/fdb.cluster on each pod to have the correct IPs
what happens in scale up/ down of nodes if a node no longer exists?
should we set ips to be persistent ? maybe stateful sets?
also since i’m relatively new to fdb and trying to asses whether or not fdb is a good use case for us can you explain how to do the following:
Manually edit the connection string in the operator status to have the correct IPs
Before scaling down nodes, we will exclude the nodes to be removed, and change the coordinators to only include remaining nodes.
StatefulSet wouldn’t provide any guarantees of IP persistence. If the pods get deleted, the StatefulSet controller would create new pods which would likely get different IPs. The issue I linked above has some discussion of what we would need to do to address this problem in the general case.
You can edit the connection string in the resource status by using kubectl edit fdb X, where X is the name of your cluster.
To bounce the processes, you’ll want to use kubectl exec to get into the foundationdb container in the pod, and run pkill fdbserver or some other command that will kill the processes.
Thank you for the useful cluster recovery tips. I ran into this when a couple of k8s nodes were scaled down at the same time and figured out how to get this recovery to work.