Incomplete exclusion in FDB operator

Operator 1.28.1, FDB 7.1.43

We had an issue where a large number of excludes caused FDB to do critical recruitment of storage
roles on log machines, causing the log disks to quickly fill up and the cluster to stop working.

In order to get out of this we

  • Turned off the operator
  • Included all storage machines
  • Excluded log machines
  • Moved coordinators out of log machines
  • Included log machines when the storage role was removed by exclusion

This brought the cluster back to life.

In order allow the operator to run again, we re-excluded all machines that were marked for removal by the operator, but in smaller batches, so it did not trigger critical recruiting.

When exclusion was finished we deleted the PVCs, pods and services of the storage machines that were going away. This turned out to be a mistake.

When the operator came back, it recreated all pvcs, pods and services. This was arguably correct, since they were not marked as fully excluded in the CRD status (though a bit unnecessary when the PVCs are gone anyway).

All these process groups now had two IP addresses, the old and the new one. When running excludes, the operator seems to have used the old IP, and have gotten a response from FDB that this machine is not longer part of the cluster. At the same time the process with the new IP was still included in the cluster, and was receiving shards of data.

Then the operator started removing process groups, and we had dips in replication level, as the machines were removed even though had a storage process containing data.

This is probably a strange corner case, but I would assume it a would be a safe extension to the operator to verify exclusion for all addresses, in the case where a process group has multiple.

Thanks for your report, do you mind to open an issue in the operator repo? We can discuss potential solutions there and link that in any PR(s) we have for that.

One solution here is not to use the IP based exclusion but the locality-based exclusion: fdb-kubernetes-operator/docs/cluster_spec.md at main · FoundationDB/fdb-kubernetes-operator · GitHub (useLocalitiesForExclusion). Since you cluster is running on 7.1.43, and the required version is 7.1.42, you can make use of this feature.

I don’t have permission to open issues in the repository, as I am not a contributor.

I will look into using locality-based exclusion, thanks!

CC @ammolitor could you help here? It would be nice to add @larshagen as a contributor, to allow to create issues and PRs.

sure. @larshagen what is your GitHub ID?

My github ID is larshagencognite

Just sent an invite. Thanks for your patience.

Thanks! It seems that the invite has expired over the holidays, @ammolitor could you resend?

Just re-sent the invite.