2021-02-11T11:19:48.082Z INFO controller Running command {"namespace": "timeseries", "cluster": "foundationdb-cluster-1", "path": "/usr/bin/fdb/6.2/fdbcli", "args": ["/usr/bin/fdb/6.2/fdbcli", "--exec", "exclude 10.0.254.134 10.244.62.11 10.0.79.183 10.244.69.12 10.0.54.48 10.244.111.14 10.244.220.13 10.244.175.8 10.244.92.11 10.244.173.9 10.244.101.8 10.244.60.12 10.0.135.254 10.244.83.9 10.0.208.30 10.244.74.9 10.0.29.165 10.244.77.13 10.0.127.206 10.244.227.10 10.0.144.250 10.244.228.10 10.0.215.51 10.244.88.9 10.0.210.156 10.244.75.11 10.0.87.196 10.244.106.13 10.0.213.53 10.244.63.10 10.0.85.224 10.244.70.12 10.0.16.24 10.244.183.12 10.0.38.249 10.244.100.12 10.0.119.184 10.244.98.10 10.0.22.166 10.244.22.15 10.0.89.9 10.244.110.12 10.244.216.12 10.244.93.13 10.244.52.13 10.0.71.227 10.244.209.12 10.0.35.98 10.244.90.12 10.0.131.63 10.244.105.14 10.0.26.28 10.244.109.13 10.0.129.20 10.244.56.12 10.0.244.3 10.244.58.11 10.0.191.229 10.244.102.12 10.0.189.192 10.244.48.13 10.0.227.17 10.244.176.10 10.244.79.15 10.244.64.10 10.244.81.12 10.244.230.11 10.244.50.10 10.244.221.11 10.244.202.13 10.244.218.9 10.244.222.12 10.0.93.176 10.244.108.16 10.0.12.252 10.244.87.12 10.0.56.215 10.244.215.11 10.0.218.187 10.244.76.13 10.0.106.42 10.244.223.12 10.0.249.113 10.244.95.15 10.0.202.124 10.244.247.10 10.0.12.43 10.244.172.10 10.0.64.147 10.244.133.13", "-C", "/tmp/529281271", "--log", "--timeout", "10", "--log-dir", "/var/log/fdb"]}
2021-02-11T11:19:48.254Z ERROR controller Error from FDB command {"namespace": "timeseries", "cluster": "foundationdb-cluster-1", "code": 1, "stdout": "ERROR: This exclude may cause the total free space in the cluster to drop below 10%.\nType `exclude FORCE <ADDRESS>*' to exclude without checking free space.\n", "stderr": "", "error": "exit status 1"}
After updating the cluster spec to use service IP, a migration was triggered, creating a new set of pods and moving data over to the new pods. After this step the controller is stuck, because it can not exclude the old pods. The reason is that the pods are using roughly %50 of disk, so the disk space calculation very conservatively assumes that the old pods may not be excluded.
Would it be possible for the controller to exclude old pods in smaller batches, so that the free disk check succeeds?
The only workaround I can think of for now is to scale up until the old pods are a sufficiently small percentage of the overall cluster, but that option will not always work for us, as we have some environments that are more resource constrained.