Storage servers 95% full - how to recover

I have an FDB cluster (version 6.2.19) with double replication factor consisting of 13 machines where each machine has 7 processes (total 98 processes). On each machine I am using 4 mount points (of different sizes). My processes are assigned to the different mount points in the following way:
Screenshot 2020-08-27 at 11.16.40 PM
I was migrating data to the cluster and had migrated 20.7TB of data when I noticed I wasn’t able to write to (or read from) the cluster anymore and running status displayed the following messages:

Unable to start default priority transaction after 5 seconds
Unable to start batch priority transaction after 5 seconds
Unable to retrieve all status information
Performance limited by process: Storage server running out of space (approaching 5% limit)

I took a look at the disk utilisation of my machines (these machines are used only as part of the cluster) and saw the following distribution on 8 (out of 13) of my machines that had only storage processes:
Screenshot 2020-08-27 at 11.28.31 PM

To make the cluster usable again I tried adding another machine to the cluster and also adding another process to Mount Point 4 hoping that the cluster would rebalance itself but the moving data stayed at 0.0GB. I also tried excluding the limiting processes but that didn’t trigger the data movement either. It seemed as though the cluster was “stuck” as even commands like setclass weren’t reflecting any changes. Then I tried killing the data distributor process as mentioned here but that changed the values of “Sum of key-value sizes”, “Moving Data” and “Replication Health” to unknown and the cluster remained stuck.

My issue is similar to this but I can’t use the solutions mentioned here as I don’t have enough resources to add so many additional machine and my cluster is already double replication mode so I can’t bring down the replication factor without risking data loss.

This is a critical issue for me and any suggestions on how to recover the cluster would be appreciated. Also, I have a few questions regarding how data is distributed on storage servers:

  • This post mentioned that the “data distribution process tries to keep roughly the same amount of data on every storage server”. Does this mean that the same amount of data is kept on each server or the same percentage (out of total available) of data is kept on each server?

  • The post linked above also mentioned that performance degradation when any storage server reaches 95% of its capacity is intended as that makes it easier to recover. Is it guaranteed that the cluster will always be recoverable in such scenarios?

1 Like

You can try changing the knob_min_available_space_ratio to < 5% temporarily to help cluster recover out of current situation (and then change it back once you have been able to make sufficient space on disk). But I think using non-uniform disk sizes might continue to result in such issues.

AFAIK, same absolute amount of data is kept on each SS (irrespective of mount points each SS is on). There might have been few tweaks to this behavior as hinted here.

2 Likes

How do you know the management commands like exclude, setclass didn’t work?
As far as I know, ratekeeper only prevent non-immediate transactions as

Unable to start default priority transaction after 5 seconds
Unable to start batch priority transaction after 5 seconds
shows, default and batch transactions.

However, exclude and setclass using IMMEDIATE priority instead…

Thanks a lot, making this change helped me recover my cluster.

That’s true, if data distributor indeed aims at storing the same absolute amount of data on each SS then we’ll always be limited by our smallest disk. @ajbeamon @alloc @alexmiller Would any of you mind providing some more details here about data distribution so we can structure our cluster accordingly?

These commands didn’t seem to work as they didn’t result in the same behaviour that they usually do.

  • Running exclude usually results in a spike to the Moving Data as the data present on the excluded servers is moved to other servers, but I noticed that the Moving Data remained at 0.0GB even after ~30 minutes.

  • The changes that I was making via setclass were not reflecting in the status json.

It should be the case that data distribution tries to balance data such that each storage server is holding the same amount. This would explain why your mount point 4 with 1 process is using 40% and mount point 3 with 2 processes is using 80% of 2T disks. Once disks start to get full, there are mechanisms that seek to penalize moving data to those that are fuller, which might explain why mount point 1 isn’t as full as you might expect given this description. These mechanisms weren’t really designed to support non-uniform disk sizes, though, so you will be limited in the amount of space you can store by the process that has the least disk space available to it.

I see. I think data movement still use DEFAULT priority transactions, that’s why ratekeeper prevent data movement transactions as well.

We have an FDB cluster (version 6.2.27) with triple replication consisting of 12 machines. We ran into the Performance limited by process: Storage server running out of space (approaching 5% limit). warning and now the cluster is in a stalled state. We have tried the following without any success.

  1. Created 6 additional nodes, 18 total, and successfully added them to the cluster, but it did not do much and just remained stuck.
  2. Decreased replication down to double from triple. It still did not start rebalancing.
  3. Created an additional 12 nodes, 30 total, and added them to the cluster. It remained stalled.
  4. Excluded initial 18 nodes from cluster. It remained stalled.
  5. Added additional 8 nodes to cluster, total 38 (18 excluded), to try and make sure we have more new nodes than excluded. It still has not moved.

Our next effort will be to set the knob knob_min_available_space_ratio to less than 5% in order to kickoff the cluster recovery. Is there anything else we should be trying to do to restore this cluster that we have not attempted?

Modifying this knob did result in data moving and the cluster reblancing.