We are running the kubernetes operator in GCP, and we seem to have gotten into an error loop, where the operator is continuously bouncing all processes every 15 minutes or so.
From running fdbcli status json
, it looks like the cluster is healthy, and I am not able to tell from the logs what the error could be.
Any pointers on how to investigate further would be appreciated.
From the operator logs:
2020-04-29T10:40:07.679Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "kill; kill 10.36.57.2:4500 10.36.62.2:4500 10.36.93.2:4500 10.36.75.2:4500 10.36.74.3:4500 10.36.59.2:4500 10.36.73.3:4500 10.36.31.2:4500 10.36.10.3:4500 10.36.63.3:4500 10.36.58.3:4500 10.36.80.3:4500 10.36.71.3:4500 10.36.81.2:4500 10.36.84.3:4500 10.36.82.3:4500 10.36.76.2:4500 10.36.50.3:4500 10.36.64.2:4500 10.36.3.2:4500 10.36.77.3:4500 10.36.55.2:4500 10.36.49.3:4500 10.36.85.2:4500 10.36.51.2:4500 10.36.83.2:4500 10.36.65.3:4500; status", "-C", "/tmp/748823614", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:07.679Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"FoundationDBCluster","namespace":"default","name":"timeseries-test-cluster","uid":"360947b1-8483-11ea-84f5-42010a84009c","apiVersion":"apps.foundationdb.org/v1beta1","resourceVersion":"387738317"}, "reason": "BouncingInstances", "message": "Bouncing processes: [10.36.57.2:4500:tls 10.36.62.2:4500:tls 10.36.93.2:4500:tls 10.36.75.2:4500:tls 10.36.74.3:4500:tls 10.36.59.2:4500:tls 10.36.73.3:4500:tls 10.36.31.2:4500:tls 10.36.10.3:4500:tls 10.36.63.3:4500:tls 10.36.58.3:4500:tls 10.36.80.3:4500:tls 10.36.71.3:4500:tls 10.36.81.2:4500:tls 10.36.84.3:4500:tls 10.36.82.3:4500:tls 10.36.76.2:4500:tls 10.36.50.3:4500:tls 10.36.64.2:4500:tls 10.36.3.2:4500:tls 10.36.77.3:4500:tls 10.36.55.2:4500:tls 10.36.49.3:4500:tls 10.36.85.2:4500:tls 10.36.51.2:4500:tls 10.36.83.2:4500:tls 10.36.65.3:4500:tls]"} 2020-04-29T10:40:09.968Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": ">>> kill\n\nThe follow..."} 2020-04-29T10:40:09.971Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/897721376", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:11.993Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:12.012Z INFO controller Cluster was not fully reconciled by reconciliation process 2020-04-29T10:40:12.013Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/042902098", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:14.039Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:14.058Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/876348297", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:16.080Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:16.080Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--version", "-C", "/tmp/876348297", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:16.084Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "FoundationDB CLI 6.2..."} 2020-04-29T10:40:16.960Z INFO controller Running command {"namespace": "default", "cluster": "timeseries-test-cluster", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "status json", "-C", "/tmp/833794900", "--timeout", "10", "--log", "--log-dir", "/var/log/fdb"]} 2020-04-29T10:40:18.984Z INFO controller Command completed {"namespace": "default", "cluster": "timeseries-test-cluster", "output": "{\n \"client\" : {\n ..."} 2020-04-29T10:40:18.984Z INFO controller Waiting for database to be healthy {"namespace": "default", "cluster": "timeseries-test-cluster"} 2020-04-29T10:40:18.985Z INFO controller Reconciliation terminated early {"namespace": "default", "name": "timeseries-test-cluster", "lastAction": "controllers.UpdateDatabaseConfiguration"} 2020-04-29T10:40:18.985Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"FoundationDBCluster","namespace":"default","name":"timeseries-test-cluster","uid":"360947b1-8483-11ea-84f5-42010a84009c","apiVersion":"apps.foundationdb.org/v1beta1","resourceVersion":"387738562"}, "reason": "NeedsConfigurationChange", "message": "Spec require configuration change to
triple ssd-2 usable_regions=1 logs=6 proxies=3 resolvers=1 log_routers=-1 remote_logs=-1 regions=, but cluster is not healthy"} 2020-04-29T10:40:18.985Z INFO controller Requeuing reconciliation {"subReconciler": "controllers.UpdateDatabaseConfiguration", "namespace": "default", "cluster": "timeseries-test-cluster"}