FDB K8s Operator stuck after FDB 7 migration

manfontan · July 6, 2023, 3:07pm

During FDB migrations from FDB 6 to FDB 7 including Storage engine migration to Redwood as well. The FDB K8s operator for some of our clusters was stuck reporting the following error:

fdb-kubernetes-operator-controller-manager-85f867b4c-czckp manager {"level":"error",
"ts":1688475467.4898155,
"logger":"fdbclient",
"msg":"Error from FDB command",
"namespace":"timeseries",
"cluster":"foundationdb-cluster",
"code":1,
"stdout":"Unexpected error loading cluster file `/tmp/dcabf3f4-c179-4894-a554-cce8a9986a60': 1513 File could not be read\n",
"stderr":"",
"error":"exit status 1",
"stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).ExcludeProcesses\n\t/workspace/fdbclient/admin_client.go:442\n
  github.com/FoundationDB/fdb-kubernetes-operator/internal/statuschecks.CanSafelyRemoveFromStatus
    \n\t/workspace/internal/statuschecks/status_checks.go:139
  \ngithub.com/FoundationDB/fdb-kubernetes-operator/internal/removals.GetRemainingMap
    \n\t/workspace/internal/removals/remove.go:146
  \ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.removeProcessGroups.reconcile
    \n\t/workspace/controllers/remove_process_groups.go:61
  \ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile
    \n\t/workspace/controllers/cluster_controller.go:183
  \nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    \n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:121
  \nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    \n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:320
  \nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    \n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273
  \nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    \n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}

Somehow the Operator end up in a situation where the connection string stored in /tmp/dcabf3f4-c179-4894-a554-cce8a9986a60 was empty. Because of this the adminClient was failing to read the file and returning the error again and again without being able to make progress.

All the processes in the FoundationDBCluster CRD status had the missingProcesses and the incorrectConfigMap conditions. The process causing the blockage had the removalTimestamp set.

Excluding the process marked for removal and editing the cluster status and marking the process marked for removal as excluded by setting the exclussionTimestamp unlocked the operator.

The cluster status also had the HasIncorrectConfigMap flag set to true.

So at some point the operator set the cluster.Status.Connection string to “” and then was not able to load it again resulting in an empty cluster file under /tmp/dcabf3f4-c179-4894-a554-cce8a9986a60

This feels like a corner case but if it rings any bells and requires a fix. I am happy to work on it.

johscheuer · July 6, 2023, 4:28pm

What version of the operator are you using?

manfontan · July 7, 2023, 8:09am

oops sorry. I should have mention that in my initial post. We are running v1.19.0

manfontan · July 7, 2023, 8:27am

This is the FoundationDB CRD status for one of the clusters that got stuck. In this case the Pod with the removal timestamp was a coordinator. So we had to manually change the coordinators, then manually exclude the process group and finally edit the status to include the exclusionTimestamp. After that the Operator resumed and worked fine.

    "status": {
                "configured": true,
                "databaseConfiguration": {
                    "log_routers": -1,
                    "redundancy_mode": "double",
                    "remote_logs": -1,
                    "storage_engine": "ssd-2",
                    "usable_regions": 1
                },
                "desiredProcessGroups": 11,
                "generations": {
                    "hasUnhealthyProcess": 9,
                    "missingDatabaseStatus": 9,
                    "needsConfigurationChange": 9,
                    "needsMonitorConfUpdate": 9,
                    "needsShrink": 9
                },
                "hasIncorrectConfigMap": true,
                "hasListenIPsForAllPods": true,
                "health": {},
                "imageTypes": [
                    "split"
                ],
                "locks": {},
                "maintenanceModeInfo": {},
                "processGroups": [
                    {
                        "addresses": [
                            "173.30.83.98"
                        ],
                        "processClass": "log",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688324961,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "log-1"
                    },
                    {
                        "addresses": [
                            "173.30.86.229"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-kjql4.c.mycluster.internal",
                        "processClass": "log",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "log-4",
                        "removalTimestamp": "2023-07-02T18:52:37Z"
                    },
                    {
                        "addresses": [
                            "173.30.230.161"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-9jqqq.c.mycluster.internal",
                        "processClass": "log",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "log-5"
                    },
                    {
                        "addresses": [
                            "173.30.92.207"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-d-zdnr2.c.mycluster.internal",
                        "processClass": "log",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "log-6"
                    },
                    {
                        "addresses": [
                            "173.30.207.142"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-b-fcphm.c.mycluster.internal",
                        "processClass": "stateless",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "stateless-4"
                    },
                    {
                        "addresses": [
                            "173.30.58.36"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-kjql4.c.mycluster.internal",
                        "processClass": "stateless",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "stateless-5"
                    },
                    {
                        "addresses": [
                            "173.30.70.248"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-htcbg.c.mycluster.internal",
                        "processClass": "stateless",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "stateless-6"
                    },
                    {
                        "addresses": [
                            "173.30.194.19"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-kjql4.c.mycluster.internal",
                        "processClass": "storage",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "storage-1"
                    },
                    {
                        "addresses": [
                            "173.30.130.166"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-nwqmv.c.mycluster.internal",
                        "processClass": "storage",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "storage-2"
                    },
                    {
                        "addresses": [
                            "173.30.25.160"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-c-htcbg.c.mycluster.internal",
                        "processClass": "storage",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "storage-3"
                    },
                    {
                        "addresses": [
                            "173.30.125.125"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-b-tk84z.c.mycluster.internal",
                        "processClass": "storage",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316523,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "storage-4"
                    },
                    {
                        "addresses": [
                            "173.30.46.248"
                        ],
                        "faultDomain": "mycluster-zzhfw-worker-b-fcphm.c.mycluster.internal",
                        "processClass": "storage",
                        "processGroupConditions": [
                            {
                                "timestamp": 1688316616,
                                "type": "MissingProcesses"
                            },
                            {
                                "timestamp": 1688324962,
                                "type": "IncorrectConfigMap"
                            }
                        ],
                        "processGroupID": "storage-5"
                    }
                ],
                "requiredAddresses": {
                    "tls": true
                },
                "runningVersion": "7.1.27",
                "storageServersPerDisk": [
                    1
                ]
            }
        }
    ],

johscheuer · July 7, 2023, 8:44am

Could you update the operator to 1.20.0, this will include a bug fix for the behaviour: Make sure we set the connection string if the status is cached by johscheuer · Pull Request #1688 · FoundationDB/fdb-kubernetes-operator · GitHub. If you cannot update, you can set --cache-database-status=false to prevent this bug too.

And if you’re running with more than one storage server per Pod, you should consider to upgrade to 1.20.1 directly: Release v1.20.1 · FoundationDB/fdb-kubernetes-operator · GitHub

Topic		Replies	Views
FDB operator stuck without recreating pods Kubernetes Operator operator	4	384	February 22, 2024
FDB kubernetes operator continuously boucing processes Kubernetes Operator	8	1000	May 3, 2020
How to upgrade apple operator 0.48 to 1.4.1 without deleting the old crd? Kubernetes Operator operator	41	1179	October 13, 2022
FDB Cluster upgrade does not deletes old pods Kubernetes Operator	1	206	July 24, 2023
Getting timeout error while reconciling foundationDB cluster in kubernetes Kubernetes Operator	1	787	June 1, 2020

FDB K8s Operator stuck after FDB 7 migration

Related topics