Storage server running out of space

ThomasJ · September 28, 2019, 2:44am

I always show up with some fun issue. This time is no different.

After restoring the cluster, everything worked fine up until two of our 14 servers went down at the same time. Normally, this would be not issue, however thanks to triple replication, the cluster got saturated and is now running out of space.

Because these are NVME servers, they ran into an issue that they can’t recover (GCP thing).

I thought, no problem, will just add more servers and things will rebalance. However, that’s not what happened. Now everything seems to be crippled to almost a halt. I added 10 more servers so they added almost twice more new storage, but the cluster is still reporting that one of the server is running out of space.

I found this discussion that explains why but not the solution. So what are my options here? How do I tell the cluster that it now has plenty of space and it should start rushing the rebalancing?

I tried excluding the process, but it has no impact.

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 3
  Exclusions             - 1 (type `exclude' for details)
  Desired Proxies        - 7
  Desired Resolvers      - 1
  Desired Logs           - 7

Cluster:
  FoundationDB processes - 72 (less 4 excluded; 0 with errors)
  Machines               - 18 (less 1 excluded)
  Memory availability    - 7.9 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 09/27/19 19:42:27

Data:
  Replication health     - HEALING: Only two replicas remain of some data
  Moving data            - 372.732 GB
  Sum of key-value sizes - 3.649 TB
  Disk space used        - 3.322 TB

Operating space:
  Storage server         - 0.0 GB free on most full server
  Log server             - 339.4 GB free on most full server

Workload:
  Read rate              - 2 Hz
  Write rate             - 0 Hz
  Transactions started   - 1 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz
  Performance limited by process: Storage server running out of space (approaching 5% limit).
  Most limiting process: 10.128.1.131:4502

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 09/27/19 19:42:17

In addition, one of the server that died was a coordinator and when I try to change the coordinators it just does nothing.

Is there a way to help the cluster to understand that it has now plenty of space to start doing the proper rebalancing?

alexmiller · September 28, 2019, 2:52am

We love you for your reports, and apologize for your troubles.

If you just add new machines without the exclude, data distribution tries to form groups of three machines using the new ones and old ones randomly, and then adding machines to a nearly full cluster can make it go completely full.

The largest hammer here is to exclude all of your current machines, and attach an equal number of fresh processes, and then let data distribution copy from the old machines to the new ones, and then MOVE THE COORDINATORS, and tear down the old machines.

If you don’t have the excess machines to be able to do that, then you could configure from triple replication to double, let it calm down with the added machines, and then configure back to triple. As you’ll need to “remove” data from the cluster somehow to free up space while it rebalances.

Data Distribution Stopped - How to Restart? has related graphs where you can see the on disk size of files jump when machines are added to the cluster, and a bit of discussion about it.

ThomasJ · September 28, 2019, 2:55am

Ok, that’s interesting. So you are saying, spin up new servers and exclude ALL the old servers and that will help the cluster to recover.

Any reason why I can’t move coordinators first?

alexmiller · September 28, 2019, 3:03am

Telling FDB to abandon the old servers makes it do the intelligent thing of move data only from full servers to empty servers.

I’d be happy if in the future, data distribution handled new servers in two phases

Redistribute data so that disk fullness is roughly equal
Redistribute data so that storage teams are sufficiently randomly chosen.

Which would make FDB’s behavior in these sorts of situations better. Right now, it just goes straight to (2), so mass excluding is the way to force it to do (1).

No, you should be able to.

I did kind of skip that part of your question before. I think you can’t move them right now because coordinators need to write down a forwarding record in order to move, which they probably can’t do if the disk is 100% full. But I’m not sure, I also would have expected that they could be moved. AJ generally has better ideas on these sorts of things than I.

Ratekeeper is supposed to clamp down entirely to prevent 100% disk fullness, so I’d also be curious how you got a disk 100% full.

ThomasJ · September 28, 2019, 3:11am

It’s actually not 100% full. There is 5% still left.

Performance limited by process: Storage server running out of space (approaching 5% limit).

However somehow can’t change the coordinators (it just hangs).

That would be great. This seems quite unfortunate way to do it.

Doing just that. Fortunately we have “unlimited” amount of servers available in cloud and these get lost in those tens of thousands we already run. Hope it works out well.

ThomasJ · September 28, 2019, 3:21am

Doesn’t seem to be doing much after the exclusion

fdb> status

WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file fdb.cluster.

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.128.2.182:4500  (unreachable)
  10.128.4.254:4500  (reachable)
  10.128.6.87:4500  (reachable)

Unable to start default priority transaction after 5 seconds.

Unable to start batch priority transaction after 5 seconds.

Unable to retrieve all status information.

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 3
  Exclusions             - 12 (type exclude for details)
  Desired Proxies        - 7
  Desired Resolvers      - 1
  Desired Logs           - 7

Cluster:
  FoundationDB processes - 128 (less 48 excluded; 0 with errors)
  Machines               - 32 (less 12 excluded)
  Memory availability    - 7.9 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 09/27/19 20:19:23

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 7.895 TB

Operating space:
  Storage server         - 0.0 GB free on most full server
  Log server             - 355.1 GB free on most full server

Workload:
  Read rate              - 5 Hz
  Write rate             - 0 Hz
  Transactions started   - 5 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz
  Performance limited by process: Storage server running out of space (approaching 5% limit).
  Most limiting process: 10.128.1.131:4502

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 09/27/19 20:19

Will give it some time and see

ThomasJ · September 28, 2019, 4:29am

I have tried to add completely new bunch of servers and exclude all of the previous ones. But still no impact. Any other ideas?


WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file `fdb.cluster'.

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.128.2.182:4500  (unreachable)
  10.128.4.254:4500  (reachable)
  10.128.6.87:4500  (reachable)

Unable to start default priority transaction after 5 seconds.

Unable to start batch priority transaction after 5 seconds.

Unable to retrieve all status information.

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 3
  Exclusions             - 33 (type exclude for details)
  Desired Proxies        - 7
  Desired Resolvers      - 1
  Desired Logs           - 7

Cluster:
  FoundationDB processes - 208 (less 128 excluded; 0 with errors)
  Machines               - 52 (less 32 excluded)
  Memory availability    - 7.9 GB per process on machine with least available
  Retransmissions rate   - 4 Hz
  Fault Tolerance        - 0 machines
  Server time            - 09/27/19 21:28:35

Data:
  Replication health     - HEALING: Only two replicas remain of some data
  Moving data            - 338.440 GB
  Sum of key-value sizes - 3.649 TB
  Disk space used        - 13.447 TB

Operating space:
  Storage server         - 0.0 GB free on most full server
  Log server             - 355.1 GB free on most full server

Workload:
  Read rate              - 832 Hz
  Write rate             - 3 Hz
  Transactions started   - 14 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz
  Performance limited by process: Storage server running out of space (approaching 5% limit).
  Most limiting process: 10.128.1.131:4502

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 09/27/19 21:28:25```

alexmiller · September 28, 2019, 4:45am

Is moving data decreasing? The cluster will stay stuck for a little while until it manages to fully move shards off of the storage servers that are full. If moving data decreases, then data is moving, and you’re on your way to a healthy cluster. If not, then we wait for @ajbeamon or @john_brownlee to appear and give their better operational wisdom.

ThomasJ · September 28, 2019, 4:47am

Yep, it actually just finally got unstuck and is now moving data at full speed. Took a while thought.

Thanks for your help. Maybe you guys could figure out a way to speed up this process?

alexmiller · September 28, 2019, 8:35am

Unlimited cloud servers to the rescue!

Data distribution speed has been a tricky topic. It’s been tuned down over time as data distribution was seen impacting client latencies. For example, see How to speed up balancing? as a report of data distribution running causing degraded throughput. However, as you’ve noticed, there’s times when you’d prefer data distribution to run as quickly as possible.

So the root issue here is that we have data distribution set to a static rate, and we need to turn it into a dynamic rate based on how much idle disk/cpu time the cluster has, and what the priority of the data distribution action is. Smoothing out a 5% data imbalance should be throttled down in the face of user traffic, but moving half of a full storage server to an empty one should be able to use all the resources of a storage server if ratekeeper has already blocked all client traffic to the cluster.

#1046 is the issue to follow for this. @fzhjon wrote Introduce priority to fetchKeys requests from data distribution #1791, which is the first step in this direction. I’m actually unsure how much more work is scheduled to happen on this before 7.0… @fzhjon?

ThomasJ · September 28, 2019, 6:21pm

That’s my thinking. It took 12 hours for us to drain and move the cluster before it became usable. That’s 12 hours of outage rather than whatever else was possible if it was running at a full speed.

Anyway, appreciate the help!

ajbeamon · September 30, 2019, 4:06pm

Another thing that could be affecting the speed at which this recovers is that the data moved off of the full hosts has to be cleaned up and made available for reuse. Like with any other clear, the work to free up space on disk after moving data is deferred, and it’s possible that this could be a contributor to the delay. In 6.1, the rate that this cleanup happened was turned up significantly, while at the same time data movement speeds were reduced some. Out of curiosity, what version are you running?

ajbeamon · September 30, 2019, 4:19pm

I guess the text here is a bit misleading, but I believe you will see this message even as you hit and then pass the 5% threshold. Based on your status indicating that there is 0.0GB remaining (this number subtracts out the 5% minimum, but won’t go below 0), I think that suggests you have done so.

At 5%, the cluster should in theory completely lock down new transactions (at least at non-immediate priorities), but there could be other things still contributing to further disk space usage. For example, trace logs can still be written, and it is also possible to leak transactions if recoveries are happening on the cluster. Certainly any other non-FDB processes on the host would be able to write to the disk if any existed.

We could determine the actual amount of space available from the StorageMetrics trace event, which has fields KvStoreBytesAvailable and KvStoreBytesTotal. The ratio of these is what is used by ratekeeper to determine whether to slow down the transaction rate. These numbers won’t correspond exactly to numbers reported by the OS because the available bytes includes data within the files that is available for reuse to FoundationDB.

fzhjon · September 30, 2019, 4:48pm

There is currently no immediate work scheduled for this before 7.0, as far as I’m aware. I may end up being able to return to this should time permit, but I’m not sure it’s guaranteed.

ajbeamon · October 1, 2019, 3:43pm

I’m not sure it was intended that coordinators can’t be changed when ratekeeper is not allowing transactions to start at normal priority, but I was able to reproduce it in my own test cluster. I’ve submitted a PR which should resolve this particular issue: Use immediate priority for coordinator changes by ajbeamon · Pull Request #2191 · apple/foundationdb · GitHub.

ThomasJ · October 2, 2019, 8:31pm

Sorry for the delay, been bit busy.

6.1.8

Is this something internal?

I don’t know why it reported 0 because the server had 50GB space which is ~5%.

It’s on different hard drive

We have 4 processes, 2 storage and 1 log and 1 stateless

Glad to hear it wasn’t a fluke and it’s something reproducible. It caused quite a lot of issues considering that one of the servers that went down was coordinator. Would we lose one more, the cluster would be lost forever.

ajbeamon · October 2, 2019, 8:48pm

Yes, see for example this post which roughly describes the steps that occur when data is cleared from a process. The gist is that the actual clearing of key ranges doesn’t do much work initially but defers the cleanup to a background process.

The number reported subtracts out the 5% floor, so 0.0GB means you have 5% or less space remaining.

Topic		Replies	Views
Migrating from a large cluster to another Using FoundationDB	14	2420	November 6, 2018
Troubles scaling up the cluster Using FoundationDB	31	3748	November 1, 2018
Full disk on one machine results in 99% performance degradation Using FoundationDB	5	2224	November 8, 2018
Help me understand this status output Using FoundationDB	12	3597	June 15, 2021
Data distribution / Disk usage uneven: bifurcated at 2 tiers Using FoundationDB	6	537	August 6, 2020

Storage server running out of space

Related topics