Process group stuck in ResourcesTerminating state

larshagen · March 28, 2022, 1:28pm

This happens when running the v1.0.0 operator. A pod has been deleted from the cluster, the process group looks like this:

{
                        "addresses": [
                            "10.83.79.152"
                        ],
                        "exclusionTimestamp": "2022-03-28T11:29:34Z",
                        "processClass": "log",
                        "processGroupConditions": [
                            {
                                "timestamp": 1648466973,
                                "type": "ResourcesTerminating"
                            }
                        ],
                        "processGroupID": "log-1",
                        "removalTimestamp": "2022-03-28T11:20:55Z"
                    },

The timestamp of the process group condition is the same as the exclusion timestamp. Neither the PVC nor the service is deleted, and the cluster has been stuck in this state for a few hours.

The operator keeps logging Waiting for volume claim to get torn down, so it seems it is trying to confirm that the process group is deleted, but it does not actually remove the PVC and service, so it doesn’t move forward.

johscheuer · March 28, 2022, 3:04pm

What is the state of the PVC and the underlying PV? The operator will delete all related resources (if they exist) and then wait/check until they have a deletionTimestamp fdb-kubernetes-operator/remove_process_groups.go at main · FoundationDB/fdb-kubernetes-operator · GitHub.

Just to clarify the Pod was deleted but the PVC and the Service is not deleted? If the label config (fdb-kubernetes-operator/cluster_spec.md at main · FoundationDB/fdb-kubernetes-operator · GitHub) wouldn’t match I would suspect that the operator is not able to delete any resource.

Could you validate if the service and the PVC have those labels: fdb-kubernetes-operator/foundationdb_labels.go at main · FoundationDB/fdb-kubernetes-operator · GitHub?

larshagen · March 28, 2022, 4:13pm

Yes, pod has been deleted, but not PVC nor service.
PVC labels:

labels:
    foundationdb.org/fdb-cluster-name: foundationdb-cluster
    foundationdb.org/fdb-process-class: log
    foundationdb.org/fdb-process-group-id: log-1

Service labels

labels:
    foundationdb.org/fdb-cluster-name: foundationdb-cluster
    foundationdb.org/fdb-process-class: log
    foundationdb.org/fdb-process-group-id: log-1

larshagen · March 28, 2022, 6:53pm

If a process group is markedForRemoval, and the pod is deleted externally, then the pod will be given the status terminating, right? fdb-kubernetes-operator/update_status.go at 5023f43b5ede3b72183a50033d261090379d1d5b · FoundationDB/fdb-kubernetes-operator · GitHub

In this case we will remove all other conditions fdb-kubernetes-operator/update_status.go at 5023f43b5ede3b72183a50033d261090379d1d5b · FoundationDB/fdb-kubernetes-operator · GitHub

When we later do zoned removal, the process group is placed in the terminating zone, which is skipped during zone deletion fdb-kubernetes-operator/remove.go at 5023f43b5ede3b72183a50033d261090379d1d5b · FoundationDB/fdb-kubernetes-operator · GitHub

Am I on the right track here?

johscheuer · March 28, 2022, 6:57pm

Okay, that’s an interesting bug. You have to delete the PVC and the service manually. The issue is that the process group is in the ResourcesTerminating state and those process groups are skipped in the removal step and they will only be validated if all resources are deleted. The simplest solution is to change fdb-kubernetes-operator/remove_process_groups.go at main · FoundationDB/fdb-kubernetes-operator · GitHub to append(processGroupsToRemove, terminatingProcessGroups...) in order to not issue multiple deletion we can check in the removeProcessGroup method if the resource still exists and has no deletionTimestamp to trigger the deletion.

johscheuer · March 28, 2022, 6:58pm

You’re on the right track. Feel free to provide a PR with the fix otherwise I try to fix this issue tomorrow.

Thanks for reporting this issue!

larshagen · March 28, 2022, 7:34pm

BTW, is the way we set terminating state in fdb-kubernetes-operator/update_status.go at 5023f43b5ede3b72183a50033d261090379d1d5b · FoundationDB/fdb-kubernetes-operator · GitHub something we should change as part of issue #970? It sets the state to terminating if the process group is marked for removal and the pod is missing or terminating, but we should probably count it as MissingPod if the process is not fully excluded in FDB.

Topic		Replies	Views
Data loss during recovery from mass pod deletion during scale down Kubernetes Operator operator	13	695	March 25, 2022
Multi DC Coordinators Kubernetes Operator	13	211	July 8, 2024
Stateful ProcessGroups can no longer be run without PVCs (Operator 1.35+) Kubernetes Operator operator	0	101	March 19, 2024
Incomplete exclusion in FDB operator Using FoundationDB operator	8	401	January 2, 2024
FDB operator stuck without recreating pods Kubernetes Operator operator	4	381	February 22, 2024

Process group stuck in ResourcesTerminating state

Related topics