Incomplete exclusion in FDB operator

larshagen · December 14, 2023, 9:06am

Operator 1.28.1, FDB 7.1.43

We had an issue where a large number of excludes caused FDB to do critical recruitment of storage
roles on log machines, causing the log disks to quickly fill up and the cluster to stop working.

In order to get out of this we

Turned off the operator
Included all storage machines
Excluded log machines
Moved coordinators out of log machines
Included log machines when the storage role was removed by exclusion

This brought the cluster back to life.

In order allow the operator to run again, we re-excluded all machines that were marked for removal by the operator, but in smaller batches, so it did not trigger critical recruiting.

When exclusion was finished we deleted the PVCs, pods and services of the storage machines that were going away. This turned out to be a mistake.

When the operator came back, it recreated all pvcs, pods and services. This was arguably correct, since they were not marked as fully excluded in the CRD status (though a bit unnecessary when the PVCs are gone anyway).

All these process groups now had two IP addresses, the old and the new one. When running excludes, the operator seems to have used the old IP, and have gotten a response from FDB that this machine is not longer part of the cluster. At the same time the process with the new IP was still included in the cluster, and was receiving shards of data.

Then the operator started removing process groups, and we had dips in replication level, as the machines were removed even though had a storage process containing data.

This is probably a strange corner case, but I would assume it a would be a safe extension to the operator to verify exclusion for all addresses, in the case where a process group has multiple.

johscheuer · December 14, 2023, 12:45pm

Thanks for your report, do you mind to open an issue in the operator repo? We can discuss potential solutions there and link that in any PR(s) we have for that.

One solution here is not to use the IP based exclusion but the locality-based exclusion: fdb-kubernetes-operator/docs/cluster_spec.md at main · FoundationDB/fdb-kubernetes-operator · GitHub (useLocalitiesForExclusion). Since you cluster is running on 7.1.43, and the required version is 7.1.42, you can make use of this feature.

larshagen · December 15, 2023, 1:35pm

I don’t have permission to open issues in the repository, as I am not a contributor.

I will look into using locality-based exclusion, thanks!

johscheuer · December 18, 2023, 2:00pm

CC @ammolitor could you help here? It would be nice to add @larshagen as a contributor, to allow to create issues and PRs.

ammolitor · December 18, 2023, 3:27pm

sure. @larshagen what is your GitHub ID?

larshagen · December 20, 2023, 10:51am

My github ID is larshagencognite

ammolitor · December 20, 2023, 1:39pm

Just sent an invite. Thanks for your patience.

larshagen · January 2, 2024, 9:29am

Thanks! It seems that the invite has expired over the holidays, @ammolitor could you resend?

ammolitor · January 2, 2024, 2:19pm

Just re-sent the invite.

Topic		Replies	Views
Data loss during recovery from mass pod deletion during scale down Kubernetes Operator operator	13	694	March 25, 2022
Error scaling down due to free space calculation Kubernetes Operator	1	506	February 3, 2021
Stateless node keep reaching out to removed storage node Kubernetes Operator operator	0	387	April 14, 2022
Recovering from FoundationDB crashes Kubernetes Operator operator	5	899	August 24, 2021
Excluding non-storage processes (on FDB 7.1.23) Using FoundationDB	6	491	November 22, 2022

Incomplete exclusion in FDB operator

Related topics