FDB kubernetes operator continuously boucing processes

We are running the kubernetes operator in GCP, and we seem to have gotten into an error loop, where the operator is continuously bouncing all processes every 15 minutes or so.

From running fdbcli status json, it looks like the cluster is healthy, and I am not able to tell from the logs what the error could be.

Any pointers on how to investigate further would be appreciated.

From the operator logs:
2020-04-29T10:40:07.679Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "kill; kill 10.36.57.2:4500 10.36.62.2:4500 10.36.93.2:4500 10.36.75.2:4500 10.36.74.3:4500 10.36.59.2:4500 10.36.73.3:4500 10.36.31.2:4500 10.36.10.3:4500 10.36.63.3:4500 10.36.58.3:4500 10.36.80.3:4500 10.36.71.3:4500 10.36.81.2:4500 10.36.84.3:4500 10.36.82.3:4500 10.36.76.2:4500 10.36.50.3:4500 10.36.64.2:4500 10.36.3.2:4500 10.36.77.3:4500 10.36.55.2:4500 10.36.49.3:4500 10.36.85.2:4500 10.36.51.2:4500 10.36.83.2:4500 10.36.65.3:4500; status", "-C", "/tmp/748823614", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:07.679Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"FoundationDBCluster","namespace":"default","name":"timeseries-test-cluster","uid":"360947b1-8483-11ea-84f5-42010a84009c","apiVersion":"apps.foundationdb.org/v1beta1","resourceVersion":"387738317"}, "reason": "BouncingInstances", "message": "Bouncing processes: [10.36.57.2:4500:tls 10.36.62.2:4500:tls 10.36.93.2:4500:tls 10.36.75.2:4500:tls 10.36.74.3:4500:tls 10.36.59.2:4500:tls 10.36.73.3:4500:tls 10.36.31.2:4500:tls 10.36.10.3:4500:tls 10.36.63.3:4500:tls 10.36.58.3:4500:tls 10.36.80.3:4500:tls 10.36.71.3:4500:tls 10.36.81.2:4500:tls 10.36.84.3:4500:tls 10.36.82.3:4500:tls 10.36.76.2:4500:tls 10.36.50.3:4500:tls 10.36.64.2:4500:tls 10.36.3.2:4500:tls 10.36.77.3:4500:tls 10.36.55.2:4500:tls 10.36.49.3:4500:tls 10.36.85.2:4500:tls 10.36.51.2:4500:tls 10.36.83.2:4500:tls 10.36.65.3:4500:tls]"} 2020-04-29T10:40:09.968Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": ">>> kill\n\nThe follow..."} 2020-04-29T10:40:09.971Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/897721376", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:11.993Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:12.012Z INFO controller Cluster was not fully reconciled by reconciliation process 2020-04-29T10:40:12.013Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/042902098", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:14.039Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:14.058Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/876348297", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:16.080Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:16.080Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--version", "-C", "/tmp/876348297", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:16.084Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "FoundationDB CLI 6.2..."} 2020-04-29T10:40:16.960Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/833794900", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:18.984Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:18.984Z INFO controller Waiting for database to be healthy {"namespace": "default", "cluster": "timeseries-test-cluster"} 2020-04-29T10:40:18.985Z INFO controller Reconciliation terminated early {"namespace": "default", "name": "timeseries-test-cluster", "lastAction": "controllers.UpdateDatabaseConfiguration"} 2020-04-29T10:40:18.985Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"FoundationDBCluster","namespace":"default","name":"timeseries-test-cluster","uid":"360947b1-8483-11ea-84f5-42010a84009c","apiVersion":"apps.foundationdb.org/v1beta1","resourceVersion":"387738562"}, "reason": "NeedsConfigurationChange", "message": "Spec require configuration change to triple ssd-2 usable_regions=1 logs=6 proxies=3 resolvers=1 log_routers=-1 remote_logs=-1 regions=, but cluster is not healthy"} 2020-04-29T10:40:18.985Z INFO controller Requeuing reconciliation {"subReconciler": "controllers.UpdateDatabaseConfiguration", "namespace": "default", "cluster": "timeseries-test-cluster"}

It’s odd for the operator to think that your cluster is not healthy, but for status json to say that it is healthy. Would it be possible for you to pastebin and post your status json?

Regardless, the operator probably shouldn’t be looping on trying to change the configuration when the cluster is unhealthy… @john_brownlee?

Status json: https://pastebin.com/bJQnYruE

If it’s bouncing the processes repeatedly, my guess would be that it’s trying to bounce the processes to pick up some change to the command line, but the command line is still not matching what the operator expects. There’s an issue that discusses this as a potential problem: https://github.com/FoundationDB/fdb-kubernetes-operator/issues/2.

Something similar could be happening with the database configuration. Just looking at the output of the configuration command it wants to run and the status JSON, they seem compatible, but this could be a weird situation with the log_spill configuration. We added support for that recently, but the fact that the database reports a log_spill value in the status even when one has not been explicitly set presents some challenges. It might be worth investigating that once the bounce problem is resolved, since that seems more troubling to me.

Would you mind posting your cluster spec as well? That would help to understand what the operator is trying to change.

Thanks for looking into this!

Here is the cluster config: https://pastebin.com/xqQw6CYs

The operator doesn’t support specifying a custom data-dir parameter, since it supplies the data-dir parameter itself. At a minimum we should document this limitation, and add better validations so that the operator rejects the configuration instead of bouncing the processes in a loop. There are likely other parameters that have this behavior as well. We can consider allowing a custom data dir parameter, but it will probably be best handled through an explicit field in the cluster spec. Though I don’t know the details of your environment and your requirements, using a hostPath volume seems like a risky configuration to me. Based on the documentation in Kubernetes (https://kubernetes.io/docs/concepts/storage/volumes/#hostpath), it seems like a bad fit for how FDB uses its volume. We do support using custom storage classes for the PVC, so if you can model your volumes through PVs with a defined storage class, that should work as-is.

I think the attempts to change the database configuration are a false positive caused by the cluster being unhealthy due to the repeated bounces. I was able to reproduce this locally, and confirmed that it has missing values in the cluster status that causes it to think the database configuration is empty. I think it’s correct for the operator to hit a problem there, since it can’t determine if it needs to change the database configuration in that state. We may want to have it pass the change_database_configuration step in this state, to allow it to take other reconciliation steps, but have it hit a failure in update_status so that we know that reconciliation may not be complete.

Thanks, that makes sense, and the intention was not to go into production with hostpath volumes. This was deemed the quickest way to estimate performance of FDB on local SSDs and compare it to non-local persistent volumes. Modelling the local SSDs as PVs should be possible, and something we need to do if we decide to use local SSDs in production.

Should I create an issue on verifying that this property is not set in the cluster definition?

Yeah, if you don’t mind creating an issue that would be great!

Created https://github.com/FoundationDB/fdb-kubernetes-operator/issues/214