FDB K8s Operator stuck after FDB 7 migration

During FDB migrations from FDB 6 to FDB 7 including Storage engine migration to Redwood as well. The FDB K8s operator for some of our clusters was stuck reporting the following error:

fdb-kubernetes-operator-controller-manager-85f867b4c-czckp manager {"level":"error",
"ts":1688475467.4898155,
"logger":"fdbclient",
"msg":"Error from FDB command",
"namespace":"timeseries",
"cluster":"foundationdb-cluster",
"code":1,
"stdout":"Unexpected error loading cluster file `/tmp/dcabf3f4-c179-4894-a554-cce8a9986a60': 1513 File could not be read\n",
"stderr":"",
"error":"exit status 1",
"stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).ExcludeProcesses\n\t/workspace/fdbclient/admin_client.go:442\n
  github.com/FoundationDB/fdb-kubernetes-operator/internal/statuschecks.CanSafelyRemoveFromStatus
    \n\t/workspace/internal/statuschecks/status_checks.go:139
  \ngithub.com/FoundationDB/fdb-kubernetes-operator/internal/removals.GetRemainingMap
    \n\t/workspace/internal/removals/remove.go:146
  \ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.removeProcessGroups.reconcile
    \n\t/workspace/controllers/remove_process_groups.go:61
  \ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile
    \n\t/workspace/controllers/cluster_controller.go:183
  \nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    \n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:121
  \nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    \n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:320
  \nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    \n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273
  \nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    \n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}

Somehow the Operator end up in a situation where the connection string stored in /tmp/dcabf3f4-c179-4894-a554-cce8a9986a60 was empty. Because of this the adminClient was failing to read the file and returning the error again and again without being able to make progress.

All the processes in the FoundationDBCluster CRD status had the missingProcesses and the incorrectConfigMap conditions. The process causing the blockage had the removalTimestamp set.

Excluding the process marked for removal and editing the cluster status and marking the process marked for removal as excluded by setting the exclussionTimestamp unlocked the operator.

The cluster status also had the HasIncorrectConfigMap flag set to true.

So at some point the operator set the cluster.Status.Connection string to “” and then was not able to load it again resulting in an empty cluster file under /tmp/dcabf3f4-c179-4894-a554-cce8a9986a60

This feels like a corner case but if it rings any bells and requires a fix. I am happy to work on it.

What version of the operator are you using?

oops sorry. I should have mention that in my initial post. We are running v1.19.0

This is the FoundationDB CRD status for one of the clusters that got stuck. In this case the Pod with the removal timestamp was a coordinator. So we had to manually change the coordinators, then manually exclude the process group and finally edit the status to include the exclusionTimestamp. After that the Operator resumed and worked fine.

    "status": {
                "configured": true,
                "databaseConfiguration": {
                    "log_routers": -1,
                    "redundancy_mode": "double",
                    "remote_logs": -1,
                    "storage_engine": "ssd-2",
                    "usable_regions": 1
                },
                "desiredProcessGroups": 11,
                "generations": {
                    "hasUnhealthyProcess": 9,
                    "missingDatabaseStatus": 9,
                    "needsConfigurationChange": 9,
                    "needsMonitorConfUpdate": 9,
                    "needsShrink": 9
                },
                "hasIncorrectConfigMap": true,
                "hasListenIPsForAllPods": true,
                "health": {},
                "imageTypes": [
                    "split"
                ],
                "locks": {},
                "maintenanceModeInfo": {},
                "processGroups": [
                    {
                        "addresses": [
                            "173.30.83.98"
                        ],
                        "processClass": "log",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688324961,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "log-1"
                    },
                    {
                        "addresses": [
                            "173.30.86.229"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-kjql4.c.mycluster.internal",
                        "processClass": "log",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "log-4",
                        "removalTimestamp": "2023-07-02T18:52:37Z"
                    },
                    {
                        "addresses": [
                            "173.30.230.161"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-9jqqq.c.mycluster.internal",
                        "processClass": "log",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "log-5"
                    },
                    {
                        "addresses": [
                            "173.30.92.207"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-d-zdnr2.c.mycluster.internal",
                        "processClass": "log",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "log-6"
                    },
                    {
                        "addresses": [
                            "173.30.207.142"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-b-fcphm.c.mycluster.internal",
                        "processClass": "stateless",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "stateless-4"
                    },
                    {
                        "addresses": [
                            "173.30.58.36"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-kjql4.c.mycluster.internal",
                        "processClass": "stateless",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "stateless-5"
                    },
                    {
                        "addresses": [
                            "173.30.70.248"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-htcbg.c.mycluster.internal",
                        "processClass": "stateless",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "stateless-6"
                    },
                    {
                        "addresses": [
                            "173.30.194.19"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-kjql4.c.mycluster.internal",
                        "processClass": "storage",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "storage-1"
                    },
                    {
                        "addresses": [
                            "173.30.130.166"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-nwqmv.c.mycluster.internal",
                        "processClass": "storage",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "storage-2"
                    },
                    {
                        "addresses": [
                            "173.30.25.160"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-htcbg.c.mycluster.internal",
                        "processClass": "storage",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "storage-3"
                    },
                    {
                        "addresses": [
                            "173.30.125.125"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-b-tk84z.c.mycluster.internal",
                        "processClass": "storage",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316523,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "storage-4"
                    },
                    {
                        "addresses": [
                            "173.30.46.248"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-b-fcphm.c.mycluster.internal",
                        "processClass": "storage",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "storage-5"
                    }
                ],
                "requiredAddresses": {
                    "tls": true
                },
                "runningVersion": "7.1.27",
                "storageServersPerDisk": [
                    1
                ]
            }
        }
    ],

Could you update the operator to 1.20.0, this will include a bug fix for the behaviour: Make sure we set the connection string if the status is cached by johscheuer · Pull Request #1688 · FoundationDB/fdb-kubernetes-operator · GitHub. If you cannot update, you can set --cache-database-status=false to prevent this bug too.

And if you’re running with more than one storage server per Pod, you should consider to upgrade to 1.20.1 directly: Release v1.20.1 · FoundationDB/fdb-kubernetes-operator · GitHub

1 Like