FDB kubernetes operator continuously boucing processes

larshagen · April 29, 2020, 10:57am

We are running the kubernetes operator in GCP, and we seem to have gotten into an error loop, where the operator is continuously bouncing all processes every 15 minutes or so.

From running fdbcli status json, it looks like the cluster is healthy, and I am not able to tell from the logs what the error could be.

Any pointers on how to investigate further would be appreciated.

From the operator logs:
2020-04-29T10:40:07.679Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "kill; kill 10.36.57.2:4500 10.36.62.2:4500 10.36.93.2:4500 10.36.75.2:4500 10.36.74.3:4500 10.36.59.2:4500 10.36.73.3:4500 10.36.31.2:4500 10.36.10.3:4500 10.36.63.3:4500 10.36.58.3:4500 10.36.80.3:4500 10.36.71.3:4500 10.36.81.2:4500 10.36.84.3:4500 10.36.82.3:4500 10.36.76.2:4500 10.36.50.3:4500 10.36.64.2:4500 10.36.3.2:4500 10.36.77.3:4500 10.36.55.2:4500 10.36.49.3:4500 10.36.85.2:4500 10.36.51.2:4500 10.36.83.2:4500 10.36.65.3:4500; status", "-C", "/tmp/748823614", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:07.679Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"FoundationDBCluster","namespace":"default","name":"timeseries-test-cluster","uid":"360947b1-8483-11ea-84f5-42010a84009c","apiVersion":"apps.foundationdb.org/v1beta1","resourceVersion":"387738317"}, "reason": "BouncingInstances", "message": "Bouncing processes: [10.36.57.2:4500:tls 10.36.62.2:4500:tls 10.36.93.2:4500:tls 10.36.75.2:4500:tls 10.36.74.3:4500:tls 10.36.59.2:4500:tls 10.36.73.3:4500:tls 10.36.31.2:4500:tls 10.36.10.3:4500:tls 10.36.63.3:4500:tls 10.36.58.3:4500:tls 10.36.80.3:4500:tls 10.36.71.3:4500:tls 10.36.81.2:4500:tls 10.36.84.3:4500:tls 10.36.82.3:4500:tls 10.36.76.2:4500:tls 10.36.50.3:4500:tls 10.36.64.2:4500:tls 10.36.3.2:4500:tls 10.36.77.3:4500:tls 10.36.55.2:4500:tls 10.36.49.3:4500:tls 10.36.85.2:4500:tls 10.36.51.2:4500:tls 10.36.83.2:4500:tls 10.36.65.3:4500:tls]"} 2020-04-29T10:40:09.968Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": ">>> kill\n\nThe follow..."} 2020-04-29T10:40:09.971Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/897721376", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:11.993Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:12.012Z INFO controller Cluster was not fully reconciled by reconciliation process 2020-04-29T10:40:12.013Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/042902098", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:14.039Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:14.058Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/876348297", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:16.080Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:16.080Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--version", "-C", "/tmp/876348297", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:16.084Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "FoundationDB CLI 6.2..."} 2020-04-29T10:40:16.960Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/833794900", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:18.984Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:18.984Z INFO controller Waiting for database to be healthy {"namespace": "default", "cluster": "timeseries-test-cluster"} 2020-04-29T10:40:18.985Z INFO controller Reconciliation terminated early {"namespace": "default", "name": "timeseries-test-cluster", "lastAction": "controllers.UpdateDatabaseConfiguration"} 2020-04-29T10:40:18.985Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"FoundationDBCluster","namespace":"default","name":"timeseries-test-cluster","uid":"360947b1-8483-11ea-84f5-42010a84009c","apiVersion":"apps.foundationdb.org/v1beta1","resourceVersion":"387738562"}, "reason": "NeedsConfigurationChange", "message": "Spec require configuration change to triple ssd-2 usable_regions=1 logs=6 proxies=3 resolvers=1 log_routers=-1 remote_logs=-1 regions=, but cluster is not healthy"} 2020-04-29T10:40:18.985Z INFO controller Requeuing reconciliation {"subReconciler": "controllers.UpdateDatabaseConfiguration", "namespace": "default", "cluster": "timeseries-test-cluster"}

alexmiller · April 29, 2020, 11:40pm

It’s odd for the operator to think that your cluster is not healthy, but for status json to say that it is healthy. Would it be possible for you to pastebin and post your status json?

Regardless, the operator probably shouldn’t be looping on trying to change the configuration when the cluster is unhealthy… @john_brownlee?

larshagen · April 30, 2020, 9:14am

Status json: https://pastebin.com/bJQnYruE

john_brownlee · April 30, 2020, 4:17pm

If it’s bouncing the processes repeatedly, my guess would be that it’s trying to bounce the processes to pick up some change to the command line, but the command line is still not matching what the operator expects. There’s an issue that discusses this as a potential problem: https://github.com/FoundationDB/fdb-kubernetes-operator/issues/2.

Something similar could be happening with the database configuration. Just looking at the output of the configuration command it wants to run and the status JSON, they seem compatible, but this could be a weird situation with the log_spill configuration. We added support for that recently, but the fact that the database reports a log_spill value in the status even when one has not been explicitly set presents some challenges. It might be worth investigating that once the bounce problem is resolved, since that seems more troubling to me.

Would you mind posting your cluster spec as well? That would help to understand what the operator is trying to change.

larshagen · May 1, 2020, 10:51am

Thanks for looking into this!

Here is the cluster config: https://pastebin.com/xqQw6CYs

john_brownlee · May 1, 2020, 9:57pm

The operator doesn’t support specifying a custom data-dir parameter, since it supplies the data-dir parameter itself. At a minimum we should document this limitation, and add better validations so that the operator rejects the configuration instead of bouncing the processes in a loop. There are likely other parameters that have this behavior as well. We can consider allowing a custom data dir parameter, but it will probably be best handled through an explicit field in the cluster spec. Though I don’t know the details of your environment and your requirements, using a hostPath volume seems like a risky configuration to me. Based on the documentation in Kubernetes (https://kubernetes.io/docs/concepts/storage/volumes/#hostpath), it seems like a bad fit for how FDB uses its volume. We do support using custom storage classes for the PVC, so if you can model your volumes through PVs with a defined storage class, that should work as-is.

I think the attempts to change the database configuration are a false positive caused by the cluster being unhealthy due to the repeated bounces. I was able to reproduce this locally, and confirmed that it has missing values in the cluster status that causes it to think the database configuration is empty. I think it’s correct for the operator to hit a problem there, since it can’t determine if it needs to change the database configuration in that state. We may want to have it pass the change_database_configuration step in this state, to allow it to take other reconciliation steps, but have it hit a failure in update_status so that we know that reconciliation may not be complete.

larshagen · May 2, 2020, 11:39am

Thanks, that makes sense, and the intention was not to go into production with hostpath volumes. This was deemed the quickest way to estimate performance of FDB on local SSDs and compare it to non-local persistent volumes. Modelling the local SSDs as PVs should be possible, and something we need to do if we decide to use local SSDs in production.

Should I create an issue on verifying that this property is not set in the cluster definition?

john_brownlee · May 2, 2020, 8:02pm

Yeah, if you don’t mind creating an issue that would be great!

larshagen · May 3, 2020, 7:18am

Created https://github.com/FoundationDB/fdb-kubernetes-operator/issues/214

Topic		Replies	Views
FDB regular recovery with operator Kubernetes Operator	16	298	April 29, 2025
Looking to tweak FDB performance on Kubernetes with the FDB Operator Running FoundationDB performance	1	504	September 18, 2023
Operator crashes with `illegal instruction` Kubernetes Operator Development	3	312	January 31, 2023
Stateless node keep reaching out to removed storage node Kubernetes Operator operator	0	387	April 14, 2022
K8s operator fdb.cluster IP addresses issue Kubernetes Operator operator	5	1200	September 18, 2020

FDB kubernetes operator continuously boucing processes

Related topics