K8s operator fdb.cluster IP addresses issue

ghenom · August 3, 2020, 8:04pm

I’m trying to use the kubernetes operator.
I’ve successfully set up the cluster ( I hope so ) and I see the pods go up.
for some reason the operator fails on status check:
/var/dynamic-conf/fdb.cluster’.

Could not communicate with a quorum of coordination servers:
12.13.44.16:4501 (unreachable)
12.13.80.35:4501 (unreachable)
12.13.90.14:4501 (unreachable)

These are IPs of older pods that were once up and no longer exist.
How do I have the k8s operator refresh this file for all the cluster?
I’ve tried to delete pods + the operator itself, but when everything comes back again, it’s the same error with the same IPs.

john_brownlee · August 3, 2020, 8:22pm

If the pods have all had their IP changed, then recovery will be a complex process. We have an open issue on the broader problem with IP fragility (https://github.com/FoundationDB/fdb-kubernetes-operator/issues/266), but we should get it into documentation as well, until we have a real fix. I’ve never encountered this myself, so I can’t be sure of the correct recovery path. One possible path is:

Manually edit the connection string in the operator status to have the correct IPs
Manually edit the cluster files in /var/fdb/data/fdb.cluster on each pod to have the correct IPs
Bounce all of the fdbserver processes

ghenom · August 3, 2020, 8:52pm

what happens in scale up/ down of nodes if a node no longer exists?
should we set ips to be persistent ? maybe stateful sets?
also since i’m relatively new to fdb and trying to asses whether or not fdb is a good use case for us can you explain how to do the following:

Manually edit the connection string in the operator status to have the correct IPs
Bounce all of the fdbserver processes

john_brownlee · August 3, 2020, 11:39pm

Before scaling down nodes, we will exclude the nodes to be removed, and change the coordinators to only include remaining nodes.

StatefulSet wouldn’t provide any guarantees of IP persistence. If the pods get deleted, the StatefulSet controller would create new pods which would likely get different IPs. The issue I linked above has some discussion of what we would need to do to address this problem in the general case.

john_brownlee · August 4, 2020, 3:48pm

You can edit the connection string in the resource status by using kubectl edit fdb X, where X is the name of your cluster.

To bounce the processes, you’ll want to use kubectl exec to get into the foundationdb container in the pod, and run pkill fdbserver or some other command that will kill the processes.

kcking · September 18, 2020, 9:27am

Thank you for the useful cluster recovery tips. I ran into this when a couple of k8s nodes were scaled down at the same time and figured out how to get this recovery to work.

Setting the connectionString using kubectl edit fdb <cluster name> doesn’t work because of this issue: https://github.com/kubernetes/kubectl/issues/564 but I got it to work with curl!

Here’s an example of updating my connectionString on a cluster called documents (first have kubectl proxy running in another tab):

curl localhost:8001/apis/apps.foundationdb.org/v1beta1/namespaces/default/foundationdbclusters/documents/status --header Content-Type: application/json-patch+json --request PATCH --data [{"op": "replace", "path": "/status/connection
String", "value": "documents:<redacted>@10.0.1.113:4501,10.0.3.68:4501,10.0.2.119:4501"}]

Topic		Replies	Views
Recovering from FoundationDB crashes Kubernetes Operator operator	5	900	August 24, 2021
Stateless node keep reaching out to removed storage node Kubernetes Operator operator	0	387	April 14, 2022
Failure / Recovery scenario Kubernetes Operator	1	688	October 12, 2020
Fdb database unavailable result of inconsistent coordinator ips Kubernetes Operator operator	2	464	August 24, 2022
FDB operator stuck without recreating pods Kubernetes Operator operator	4	381	February 22, 2024

K8s operator fdb.cluster IP addresses issue

Related topics