During FDB migrations from FDB 6 to FDB 7 including Storage engine migration to Redwood as well. The FDB K8s operator for some of our clusters was stuck reporting the following error:
fdb-kubernetes-operator-controller-manager-85f867b4c-czckp manager {"level":"error",
"ts":1688475467.4898155,
"logger":"fdbclient",
"msg":"Error from FDB command",
"namespace":"timeseries",
"cluster":"foundationdb-cluster",
"code":1,
"stdout":"Unexpected error loading cluster file `/tmp/dcabf3f4-c179-4894-a554-cce8a9986a60': 1513 File could not be read\n",
"stderr":"",
"error":"exit status 1",
"stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).ExcludeProcesses\n\t/workspace/fdbclient/admin_client.go:442\n
github.com/FoundationDB/fdb-kubernetes-operator/internal/statuschecks.CanSafelyRemoveFromStatus
\n\t/workspace/internal/statuschecks/status_checks.go:139
\ngithub.com/FoundationDB/fdb-kubernetes-operator/internal/removals.GetRemainingMap
\n\t/workspace/internal/removals/remove.go:146
\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.removeProcessGroups.reconcile
\n\t/workspace/controllers/remove_process_groups.go:61
\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile
\n\t/workspace/controllers/cluster_controller.go:183
\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:121
\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:320
\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273
\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
Somehow the Operator end up in a situation where the connection string stored in /tmp/dcabf3f4-c179-4894-a554-cce8a9986a60
was empty. Because of this the adminClient was failing to read the file and returning the error again and again without being able to make progress.
All the processes in the FoundationDBCluster CRD status had the missingProcesses and the incorrectConfigMap conditions. The process causing the blockage had the removalTimestamp
set.
Excluding the process marked for removal and editing the cluster status and marking the process marked for removal as excluded by setting the exclussionTimestamp
unlocked the operator.
The cluster status also had the HasIncorrectConfigMap flag set to true.
So at some point the operator set the cluster.Status.Connection string to “” and then was not able to load it again resulting in an empty cluster file under /tmp/dcabf3f4-c179-4894-a554-cce8a9986a60
This feels like a corner case but if it rings any bells and requires a fix. I am happy to work on it.