Error scaling down due to free space calculation

Hi
The kubernetes operator shows this error, even though there is plenty of space.
ERROR: This exclude may cause the total free space in the cluster to drop below 10%.\nType 'exclude FORCE <ADDRESS>*' to exclude without checking free space.\n
This causes the operator never to reconcile the cluster.

kubectl get foundationdbclusters.apps.foundationdb.org reports
NAME GENERATION RECONCILED HEALTHY
foundationdb-cluster-1 3 2 true

I got here after I ran a restore to a 50 storage node cluster (to make it go faster), and then I set the number of storage nodes back to 9. This triggered a lot of moving data, which eventually succeeded, but the (now) empty storage nodes were never removed from the cluster.

It seems like the code makes the assumption that all nodes have the same amount of data as the worst node. Which is not true after the excluded nodes have been emptied and moved to the rest of the nodes.

The cluster is fine and healthy. I also tried to kill all the empty excluded storage nodes, so they are gone now. And I tried to re-include the killed pods, so the exclusion list is empty. But the problem persists. Excluding even a single IP address gives the same 10% error.

This was a test cluster, so we have no important data on it, but it would be nice to avoid this in the future, or at least know a way to get around it.

Edit: This is from fdbcli status:

fdbcli status

Configuration:
Redundancy mode - triple
Storage engine - ssd-2
Coordinators - 5
Desired Proxies - 3
Desired Resolvers - 1
Desired Logs - 6
Usable Regions - 1

Cluster:
FoundationDB processes - 18
Zones - 18
Machines - 18
Memory availability - 14.6 GB per process on machine with least available
Retransmissions rate - 1 Hz
Fault Tolerance - 2 machines
Server time - 02/01/21 08:44:59

Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 1.007 TB
Disk space used - 4.947 TB

Operating space:
Storage server - 394.8 GB free on most full server
Log server - 1016.5 GB free on most full server

Workload:
Read rate - 2994 Hz
Write rate - 840 Hz
Transactions started - 1490 Hz
Transactions committed - 5 Hz
Conflict rate - 0 Hz

Backup and DR:
Running backups - 1
Running DRs - 0

Client time: 02/01/21 08:44:59

Edit 2: from status json from what I believe is most full storage node:
“kvstore_available_bytes” : 454969856000,
“kvstore_free_bytes” : 104119144448,
“kvstore_total_bytes” : 1082125373440,
“kvstore_used_bytes” : 977125918512,

If writes are still being accepted, scaling the cluster up to 50 new nodes should let the cluster recover and successfully kill off nodes; then I’d be interested to know if scaling down to 25 works, and then down to 12. Improve logging when exclusion is failing because of free space issues · Issue #350 · FoundationDB/fdb-kubernetes-operator · GitHub seems related to this specific issue.