Context: I did some experiments with the operator in a test cluster. Under load, Kubernetes evicted the storage nodes because they were using too much memory (I had specified quite low resource requests and limits, since it is a small cluster just for experiments).
So far this is expected giving my configuration mistakes.
However, the storage Pods remained in the “evicted” state, even after adding new nodes to the cluster. This is unexpected, since other deployments in Kubernetes are automatically restarted.
So I manually killed the pods to restore the database. This changed the ip addresses of the coordinators, which apparently is not yet supported by the operator (Seems like it is a planned feature: Referencing pods by IP is fragile · Issue #266 · FoundationDB/fdb-kubernetes-operator · GitHub).
I found this thread which explains how to handle changed IP addresses: K8s operator fdb.cluster IP addresses issue
- Manually edit the connection string in the operator status to have the correct IPs
- Manually edit the cluster files in /var/fdb/data/fdb.cluster on each pod to have the correct IPs
- Bounce all of the fdbserver processes
Unfortunately, step 1 did not work. After editing the connection string the status is immediately restored to the old value.
I also tried editing the ConfigMap, but the value is changed back to the old one after a short time.
Thus, I was able to get the cluster working again, but the fdb-operator can no longer talk to the cluster. And since the ConfigMap also includes the wrong values any newly started pods will have the wrong cluster file and fail to connect.
Finally, I tried the
kubectl fdb analyze example-cluster --auto-fix command, which suggested replacing all instances in the cluster. However, this just started the new processes without updating the connection string in the operator, so it did not fix the problem.
- How can I correctly recover from IP address changes?
- Is it expected that Pods are not automatically restarted after being evicted or is this a configuration error on my side?
- From my first experience with the operator I have the impression that it is not yet ready to be used in production and I should rather try to set it up outside of Kubernetes on machines with fixed IP addresses. Would you agree with this impression or is it something that can be fixed by configuring the operator differently?