Operator 1.28.1, FDB 7.1.43
We had an issue where a large number of excludes caused FDB to do critical recruitment of storage
roles on log machines, causing the log disks to quickly fill up and the cluster to stop working.
In order to get out of this we
- Turned off the operator
- Included all storage machines
- Excluded log machines
- Moved coordinators out of log machines
- Included log machines when the storage role was removed by exclusion
This brought the cluster back to life.
In order allow the operator to run again, we re-excluded all machines that were marked for removal by the operator, but in smaller batches, so it did not trigger critical recruiting.
When exclusion was finished we deleted the PVCs, pods and services of the storage machines that were going away. This turned out to be a mistake.
When the operator came back, it recreated all pvcs, pods and services. This was arguably correct, since they were not marked as fully excluded in the CRD status (though a bit unnecessary when the PVCs are gone anyway).
All these process groups now had two IP addresses, the old and the new one. When running excludes, the operator seems to have used the old IP, and have gotten a response from FDB that this machine is not longer part of the cluster. At the same time the process with the new IP was still included in the cluster, and was receiving shards of data.
Then the operator started removing process groups, and we had dips in replication level, as the machines were removed even though had a storage process containing data.
This is probably a strange corner case, but I would assume it a would be a safe extension to the operator to verify exclusion for all addresses, in the case where a process group has multiple.