K8s operator fdb.cluster IP addresses issue

I’m trying to use the kubernetes operator.
I’ve successfully set up the cluster ( I hope so ) and I see the pods go up.
for some reason the operator fails on status check:
/var/dynamic-conf/fdb.cluster’.

Could not communicate with a quorum of coordination servers:
12.13.44.16:4501 (unreachable)
12.13.80.35:4501 (unreachable)
12.13.90.14:4501 (unreachable)

These are IPs of older pods that were once up and no longer exist.
How do I have the k8s operator refresh this file for all the cluster?
I’ve tried to delete pods + the operator itself, but when everything comes back again, it’s the same error with the same IPs.

If the pods have all had their IP changed, then recovery will be a complex process. We have an open issue on the broader problem with IP fragility (https://github.com/FoundationDB/fdb-kubernetes-operator/issues/266), but we should get it into documentation as well, until we have a real fix. I’ve never encountered this myself, so I can’t be sure of the correct recovery path. One possible path is:

  1. Manually edit the connection string in the operator status to have the correct IPs
  2. Manually edit the cluster files in /var/fdb/data/fdb.cluster on each pod to have the correct IPs
  3. Bounce all of the fdbserver processes
1 Like

what happens in scale up/ down of nodes if a node no longer exists?
should we set ips to be persistent ? maybe stateful sets?
also since i’m relatively new to fdb and trying to asses whether or not fdb is a good use case for us can you explain how to do the following:

  1. Manually edit the connection string in the operator status to have the correct IPs
  2. Bounce all of the fdbserver processes

Before scaling down nodes, we will exclude the nodes to be removed, and change the coordinators to only include remaining nodes.

StatefulSet wouldn’t provide any guarantees of IP persistence. If the pods get deleted, the StatefulSet controller would create new pods which would likely get different IPs. The issue I linked above has some discussion of what we would need to do to address this problem in the general case.

You can edit the connection string in the resource status by using kubectl edit fdb X, where X is the name of your cluster.

To bounce the processes, you’ll want to use kubectl exec to get into the foundationdb container in the pod, and run pkill fdbserver or some other command that will kill the processes.

1 Like

Thank you for the useful cluster recovery tips. I ran into this when a couple of k8s nodes were scaled down at the same time and figured out how to get this recovery to work.

Setting the connectionString using kubectl edit fdb <cluster name> doesn’t work because of this issue: https://github.com/kubernetes/kubectl/issues/564 but I got it to work with curl!

Here’s an example of updating my connectionString on a cluster called documents (first have kubectl proxy running in another tab):

curl localhost:8001/apis/apps.foundationdb.org/v1beta1/namespaces/default/foundationdbclusters/documents/status --header Content-Type: application/json-patch+json --request PATCH --data [{"op": "replace", "path": "/status/connection
String", "value": "documents:<redacted>@10.0.1.113:4501,10.0.3.68:4501,10.0.2.119:4501"}]