Process group stuck in ResourcesTerminating state

This happens when running the v1.0.0 operator. A pod has been deleted from the cluster, the process group looks like this:

{
                        "addresses": [
                            "10.83.79.152"
                        ],
                        "exclusionTimestamp": "2022-03-28T11:29:34Z",
                        "processClass": "log",
                        "processGroupConditions": [
                            {
                                "timestamp": 1648466973,
                                "type": "ResourcesTerminating"
                            }
                        ],
                        "processGroupID": "log-1",
                        "removalTimestamp": "2022-03-28T11:20:55Z"
                    },

The timestamp of the process group condition is the same as the exclusion timestamp. Neither the PVC nor the service is deleted, and the cluster has been stuck in this state for a few hours.

The operator keeps logging Waiting for volume claim to get torn down, so it seems it is trying to confirm that the process group is deleted, but it does not actually remove the PVC and service, so it doesn’t move forward.

What is the state of the PVC and the underlying PV? The operator will delete all related resources (if they exist) and then wait/check until they have a deletionTimestamp fdb-kubernetes-operator/remove_process_groups.go at main · FoundationDB/fdb-kubernetes-operator · GitHub.

Just to clarify the Pod was deleted but the PVC and the Service is not deleted? If the label config (fdb-kubernetes-operator/cluster_spec.md at main · FoundationDB/fdb-kubernetes-operator · GitHub) wouldn’t match I would suspect that the operator is not able to delete any resource.

Could you validate if the service and the PVC have those labels: fdb-kubernetes-operator/foundationdb_labels.go at main · FoundationDB/fdb-kubernetes-operator · GitHub?

Yes, pod has been deleted, but not PVC nor service.
PVC labels:

labels:
    foundationdb.org/fdb-cluster-name: foundationdb-cluster
    foundationdb.org/fdb-process-class: log
    foundationdb.org/fdb-process-group-id: log-1

Service labels

labels:
    foundationdb.org/fdb-cluster-name: foundationdb-cluster
    foundationdb.org/fdb-process-class: log
    foundationdb.org/fdb-process-group-id: log-1

If a process group is markedForRemoval, and the pod is deleted externally, then the pod will be given the status terminating, right? fdb-kubernetes-operator/update_status.go at 5023f43b5ede3b72183a50033d261090379d1d5b · FoundationDB/fdb-kubernetes-operator · GitHub

In this case we will remove all other conditions fdb-kubernetes-operator/update_status.go at 5023f43b5ede3b72183a50033d261090379d1d5b · FoundationDB/fdb-kubernetes-operator · GitHub

When we later do zoned removal, the process group is placed in the terminating zone, which is skipped during zone deletion fdb-kubernetes-operator/remove.go at 5023f43b5ede3b72183a50033d261090379d1d5b · FoundationDB/fdb-kubernetes-operator · GitHub

Am I on the right track here?

Okay, that’s an interesting bug. You have to delete the PVC and the service manually. The issue is that the process group is in the ResourcesTerminating state and those process groups are skipped in the removal step and they will only be validated if all resources are deleted. The simplest solution is to change fdb-kubernetes-operator/remove_process_groups.go at main · FoundationDB/fdb-kubernetes-operator · GitHub to append(processGroupsToRemove, terminatingProcessGroups...) in order to not issue multiple deletion we can check in the removeProcessGroup method if the resource still exists and has no deletionTimestamp to trigger the deletion.

You’re on the right track. Feel free to provide a PR with the fix otherwise I try to fix this issue tomorrow.

Thanks for reporting this issue!

1 Like

BTW, is the way we set terminating state in fdb-kubernetes-operator/update_status.go at 5023f43b5ede3b72183a50033d261090379d1d5b · FoundationDB/fdb-kubernetes-operator · GitHub something we should change as part of issue #970? It sets the state to terminating if the process group is marked for removal and the pod is missing or terminating, but we should probably count it as MissingPod if the process is not fully excluded in FDB.