FDB operator stuck without recreating pods

We have an FDB cluster that is down, and FDB operator is not able to bring it back up.

FDB 7.1.43
Operator 1.28.1
Three FDB CRDs (preparing to turn on three data hall)

All the pods are gone right now (not the operators fault), and the operator is not trying to recreate them. It is logging

{"level":"info","ts":"2023-12-08T08:49:19Z","logger":"controller","msg":"Fetch machine-readable status for reconcilitation loop","namespace":"timeseries","cluster":"foundationdb-cluster-1","cacheStatus":true}
{"level":"info","ts":"2023-12-08T08:49:19Z","logger":"controller","msg":"Trying connection options","namespace":"timeseries","cluster":"foundationdb-cluster-1","connectionString":["foundationdb_cluster_1:18A6ZRfJm1dvn84C3Bug32Rpww4l0vSK@10.115.2.142:4501,10.115.10.194:4501,10.115.12.108:4501,10.115.12.213:4501,10.115.15.25:4501"]}
{"level":"info","ts":"2023-12-08T08:49:19Z","logger":"controller","msg":"Attempting to get connection string from cluster","namespace":"timeseries","cluster":"foundationdb-cluster-1","connectionString":"foundationdb_cluster_1:18A6ZRfJm1dvn84C3Bug32Rpww4l0vSK@10.115.2.142:4501,10.115.10.194:4501,10.115.12.108:4501,10.115.12.213:4501,10.115.15.25:4501"}
{"level":"info","ts":"2023-12-08T08:49:19Z","logger":"fdbclient","msg":"Fetch values from FDB","namespace":"timeseries","cluster":"foundationdb-cluster-1","key":"\ufffd/coordinators"}
{"level":"info","ts":"2023-12-08T08:49:59Z","logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"timeseries","cluster":"foundationdb-cluster-1","key":"\ufffd/coordinators"}
{"level":"error","ts":"2023-12-08T08:49:59Z","logger":"controller","msg":"Error getting connection string from cluster","namespace":"timeseries","cluster":"foundationdb-cluster-1","connectionString":"foundationdb_cluster_1:18A6ZRfJm1dvn84C3Bug32Rpww4l0vSK@10.115.2.142:4501,10.115.10.194:4501,10.115.12.108:4501,10.115.12.213:4501,10.115.15.25:4501","error":"FoundationDB error code 1031 (Operation aborted because the transaction timed out)","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/controllers.tryConnectionOptions\n\t/workspace/controllers/update_status.go:339\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).getStatusFromClusterOrDummyStatus\n\t/workspace/controllers/cluster_controller.go:462\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:137\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235"}
{"level":"info","ts":"2023-12-08T08:49:59Z","logger":"fdbclient","msg":"Fetch values from FDB","namespace":"timeseries","cluster":"foundationdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2023-12-08T08:50:01Z","logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"timeseries","cluster":"foundationdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2023-12-08T08:50:01Z","logger":"fdbclient","msg":"found client message(s) in the machine-readable status","namespace":"timeseries","cluster":"foundationdb-cluster-1","messages":[{"name":"quorum_not_reachable","description":"Unable to reach a quorum of coordinators."}]}
{"level":"info","ts":"2023-12-08T08:50:01Z","logger":"fdbclient","msg":"database is unavailable","namespace":"timeseries","cluster":"foundationdb-cluster-1"}
{"level":"info","ts":"2023-12-08T08:50:01Z","logger":"fdbclient","msg":"retry fetching status with fdbcli instead of using the client library","namespace":"timeseries","cluster":"foundationdb-cluster-1"}
{"level":"info","ts":"2023-12-08T08:50:01Z","logger":"fdbclient","msg":"Running command","namespace":"timeseries","cluster":"foundationdb-cluster-1","path":"/usr/bin/fdb/7.1/fdbcli","args":["/usr/bin/fdb/7.1/fdbcli","--exec","status json","-C","/tmp/e2676bac-6865-4eb6-8053-c7bbb7c723f9","--log","--trace_format","xml","--log-dir","/var/log/fdb","--timeout","40"]}
{"level":"info","ts":"2023-12-08T08:50:06Z","logger":"fdbclient","msg":"Command completed","namespace":"timeseries","cluster":"foundationdb-cluster-1","output":"{\n    \"client\" : {\n ..."}
{"level":"info","ts":"2023-12-08T08:50:06Z","logger":"fdbclient","msg":"found client message(s) in the machine-readable status","namespace":"timeseries","cluster":"foundationdb-cluster-1","messages":[{"name":"quorum_not_reachable","description":"Unable to reach a quorum of coordinators."}]}
{"level":"info","ts":"2023-12-08T08:50:06Z","logger":"fdbclient","msg":"database is unavailable","namespace":"timeseries","cluster":"foundationdb-cluster-1"}
{"level":"info","ts":"2023-12-08T08:50:06Z","logger":"controller","msg":"Reconciliation run finished","namespace":"timeseries","cluster":"foundationdb-cluster-1","duration_seconds":47.045196039,"cacheStatus":true}
{"level":"error","ts":"2023-12-08T08:50:06Z","msg":"Reconciler error","controller":"foundationdbcluster","controllerGroup":"apps.foundationdb.org","controllerKind":"FoundationDBCluster","FoundationDBCluster":{"name":"foundationdb-cluster-1","namespace":"timeseries"},"namespace":"timeseries","name":"foundationdb-cluster-1","reconcileID":"1788e843-b783-439e-a832-44ee24eb5cd9","error":"fdb timeout: database is unavailable","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235"}

i am experiencing the same with multi_dc configuration

The problem is that if the database is unavailable the operator will stop here: fdb-kubernetes-operator/controllers/cluster_controller.go at main · FoundationDB/fdb-kubernetes-operator · GitHub and do the requeue. The problem is that the operator is directly doing the requeue and therefore it doesn’t create the Pods again. I’ll take a look and fix the logic. Thanks for reporting!

I created FDB operator stuck without recreating pods · Issue #1931 · FoundationDB/fdb-kubernetes-operator · GitHub for this. Sorry for the delay.

I have opened a PR to try to address this issue, any feedback welcome!