Error scaling down due to free space calculation

matiasholte · February 1, 2021, 7:02am

Hi
The kubernetes operator shows this error, even though there is plenty of space.
ERROR: This exclude may cause the total free space in the cluster to drop below 10%.\nType 'exclude FORCE <ADDRESS>*' to exclude without checking free space.\n
This causes the operator never to reconcile the cluster.

kubectl get foundationdbclusters.apps.foundationdb.org reports
NAME GENERATION RECONCILED HEALTHY
foundationdb-cluster-1 3 2 true

I got here after I ran a restore to a 50 storage node cluster (to make it go faster), and then I set the number of storage nodes back to 9. This triggered a lot of moving data, which eventually succeeded, but the (now) empty storage nodes were never removed from the cluster.

It seems like the code makes the assumption that all nodes have the same amount of data as the worst node. Which is not true after the excluded nodes have been emptied and moved to the rest of the nodes.

The cluster is fine and healthy. I also tried to kill all the empty excluded storage nodes, so they are gone now. And I tried to re-include the killed pods, so the exclusion list is empty. But the problem persists. Excluding even a single IP address gives the same 10% error.

This was a test cluster, so we have no important data on it, but it would be nice to avoid this in the future, or at least know a way to get around it.

Edit: This is from fdbcli status:

fdbcli status

Configuration:
Redundancy mode - triple
Storage engine - ssd-2
Coordinators - 5
Desired Proxies - 3
Desired Resolvers - 1
Desired Logs - 6
Usable Regions - 1

Cluster:
FoundationDB processes - 18
Zones - 18
Machines - 18
Memory availability - 14.6 GB per process on machine with least available
Retransmissions rate - 1 Hz
Fault Tolerance - 2 machines
Server time - 02/01/21 08:44:59

Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 1.007 TB
Disk space used - 4.947 TB

Operating space:
Storage server - 394.8 GB free on most full server
Log server - 1016.5 GB free on most full server

Workload:
Read rate - 2994 Hz
Write rate - 840 Hz
Transactions started - 1490 Hz
Transactions committed - 5 Hz
Conflict rate - 0 Hz

Backup and DR:
Running backups - 1
Running DRs - 0

Client time: 02/01/21 08:44:59

Edit 2: from status json from what I believe is most full storage node:
“kvstore_available_bytes” : 454969856000,
“kvstore_free_bytes” : 104119144448,
“kvstore_total_bytes” : 1082125373440,
“kvstore_used_bytes” : 977125918512,

rbtcollins · February 3, 2021, 9:22pm

If writes are still being accepted, scaling the cluster up to 50 new nodes should let the cluster recover and successfully kill off nodes; then I’d be interested to know if scaling down to 25 works, and then down to 12. Improve logging when exclusion is failing because of free space issues · Issue #350 · FoundationDB/fdb-kubernetes-operator · GitHub seems related to this specific issue.

Topic		Replies	Views
Data loss during recovery from mass pod deletion during scale down Kubernetes Operator operator	13	696	March 25, 2022
Incomplete exclusion in FDB operator Using FoundationDB operator	8	401	January 2, 2024
Stateless node keep reaching out to removed storage node Kubernetes Operator operator	0	387	April 14, 2022
Large excludes failing repeatedly Kubernetes Operator	3	462	February 11, 2021
One faulty stateless pod made cluster unavailable, and one storage server caused cluster slow Using FoundationDB	0	300	November 5, 2022

Error scaling down due to free space calculation

Related topics