Data loss during recovery from mass pod deletion during scale down

This happened recently in one of our test environments, with fdb operator version v0.41.1.

We were in the process of scaling a cluster down, when an external incident caused all FDB pods to be deleted. When the operator brought the cluster back up, it did not bring back the pods that we were in the process of draining for data, and the corresponding PVCs were deleted.

This caused the cluster to come back up without any copies of some of its data, and since the PVCs were deleted it was not possible to recover.

Is this a known issue, and has it already been fixed in a newer version of the operator?

Hello,

we still have an issue open for recreating Pods if they are not fully excluded: Operator should recreate Pods if they are not fully excluded · Issue #970 · FoundationDB/fdb-kubernetes-operator · GitHub the fix for this should be pretty trivial. Otherwise the operator won’t recreate Pods that are marked for removal (this happens during a scale down). If you have some capacity a I’m happy to review a PR for that.

In addition to that we made some changes in the operator that should prevent this from happening even without the fix from above. We changed the way how resources are teared down, this is the interesting code part: fdb-kubernetes-operator/remove_process_groups.go at main · FoundationDB/fdb-kubernetes-operator · GitHub we will only remove process groups (Pod + PVC + optional Service) if the cluster has the required fault tolerance if not the operator won’t remove anything in addition to that the operator will tear down process groups per default by fault domain (zone) with a wait period of 60 seconds between those removals. If you want you can change the wait time between removals as well as the removal mode.

In general I would suggest upgrading the operator to at least v0.51.1 which is the latest 0. version and then considering upgrading to the operator release 1.0.

Thank you for the quick reply!

Improvements in resource teardown is very good news, and we will work on getting up to date on our operator version.

Am I right in thinking that if we were on v0.51.1, the excluded pods would still not be recreated, but the PVCs would still be around, so we could bring the cluster back by reverting the storage pod count to what it was before the scale down? Would any other action be required?

We will also look into whether we can contribute a PR to solve issue #970.

Am I right in thinking that if we were on v0.51.1, the excluded pods would still not be recreated, but the PVCs would still be around, so we could bring the cluster back by reverting the storage pod count to what it was before the scale down? Would any other action be required?

That depends a little bit on the timing, if the operator reaches the point where it tries to delete all resources and the fault tolerance would still be good it would have deleted the PVCs also. In general the risk should be lower to run into this issue, the zone deletion was added later and only landed in 1.0.0 which should ensure that at least some of the data is still available. I think reverting the storage pod count should be enough but you might want to test this behaviour.

We have done some more testing, where we have repeatedly started scaling down a cluster, and then deleted all pods in the middle of the scale down.

  1. With operator version 0.41.1, the PVCs and services are deleted for the pods marked for removal.
  2. With operator version 0.51.1, the PVCs and services are not deleted, and we are able to bring the cluster back by disabling the operator, modifying the FDB CRD to revert scale down and remove marks for deletion, remove the FDB excludes with fdbcli, and re-enable the operator.
  3. With operator version 0.41.1, we were able to get the data back when using a storage class with reclaim policy “Retain”. It is then possible to recreate the PVCs and services, and then follow the same steps as for 0.51.1 to make the operator bring the cluster back (at least this works in azure, it will probably depend on the environment).

The actions we plan to take are:

  1. Upgrade the operator in all clusters, first to 0.51.1, then to 1.0.0
  2. Change our storage class to use reclaim policy “Retain”.
  3. Contribute a fix for issue #970

Thanks for the update!

We were a bit unsure what the correct fix for #970 is, on our private fork we are currently testing out always recreating pods if we cannot verify that we are at the desired fault tolerance. That is, if we fail to read the current fault tolerance, or if the value we get is below desired.

We were a bit unsure what the correct fix for #970 is, on our private fork we are currently testing out always recreating pods if we cannot verify that we are at the desired fault tolerance. That is, if we fail to read the current fault tolerance, or if the value we get is below desired.

I think there are two ways to implement that. One is the approach that you picked and only recreate Pods that are marked for removal when the desired fault tolerance is not met. Another approach would be to simply always recreate Pods when they are marked for removal but not fully excluded that would a check like this:

if processGroup.IsMarkedForRemoval() && processGroup.IsExcluded() {
   continue
}

This still gives a user a way to force a deletion of a process group by putting it on the removal list without exclusion (a process group that is marked to be removed without exclusion will be handled like it is excluded). From an implementation perspective always recreating Pods if the process group is not fully excluded should be pretty simple and probably only requires a change single code line change.

Yes, that makes sense. I think what confused us is that we tried to copy in the same logic as for pod deletion, so checking IsMarkedForRemoval, then IsExcluded and then running exclude calls to FDB to check what can be safely removed.

BTW, is there a reason we have a separate subreconciler for excluding processes that are marked for removal but not yet excluded, but we also seem to have the same logic replicated in the removeProcessGroups subreconciler?

Yes, that makes sense. I think what confused us is that we tried to copy in the same logic as for pod deletion, so checking IsMarkedForRemoval , then IsExcluded and then running exclude calls to FDB to check what can be safely removed.

The CanSafelyRemove check is not required here since the IsExcluded check will cover that.

BTW, is there a reason we have a separate subreconciler for excluding processes that are marked for removal but not yet excluded, but we also seem to have the same logic replicated in the removeProcessGroups subreconciler?

The logic has some similarities but they are doing slightly different things for example the exclude reconciler checks if we have enough running processes to actually exclude the processes that are marked for removal (https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/controllers/exclude_processes.go#L127-L161) the idea is to prevent that an exclude brings the cluster in a bad state. The remove process groups reconciler does a bunch of tests before actually removing any process groups like checking if the fault tolerance is good and will use the CanSafelyRemove (https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/fdbclient/admin_client.go#L382-L433) method to check if the process groups that are marked as removal are “fully” excluded. The list of process groups will then be removed by the defined removal mode (default is one fault domain at a time with a 1 minute wait in between those removals).

I hope that helps. In the future I plan to write down a design doc to improve this process.

Thanks for the explanation. My understanding so far:

  • excludeProcesses reconciler reads the removalTimestamp field and decides which processes to exclude in FDB. It then calls exclude, which will mark the processes as excluded in status.json
  • removeProcessGroups reconciler reads the excluded status from status.json, and runs exclude again to check if exclusion is finished. if exclusion is finished it sets the excludedTimestamp, and can start removing the process group.

We have now tested out the proposed fix for #970 (using IsExcluded() instead of fault tolerance check) on our fork, and verified that it solves our failure scenario. We will make an upstream PR soon.

  • excludeProcesses reconciler reads the removalTimestamp field and decides which processes to exclude in FDB. It then calls exclude, which will mark the processes as excluded in status.json

That is correct (there is also some more documentation here: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/manual/technical_design.md#excludeprocesses) and there is also a check if the operator is allowed to exclude processes or if to many processes are missing.

  • removeProcessGroups reconciler reads the excluded status from status.json , and runs exclude again to check if exclusion is finished. if exclusion is finished it sets the excludedTimestamp , and can start removing the process group.

That’s also correct and here is some additional documentation: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/manual/technical_design.md#removeprocessgroups

That’s a good reminder for me to review the docs I shared with you to check if they still represent the current state :slight_smile:

We have now tested out the proposed fix for #970 (using IsExcluded() instead of fault tolerance check) on our fork, and verified that it solves our failure scenario. We will make an upstream PR soon.

Awesome! Let me know if you need any help.

Here is the PR Recreating pods that are not fully excluded. by aasmundeldhuset · Pull Request #1136 · FoundationDB/fdb-kubernetes-operator · GitHub