Newly added storage nodes have disk usage at 98%

URGENT Help needed!

Our FDB cluster had 60 pods at DC1 and DC3. I added 10 more storage nodes (K8s pods) to DC3 yesterday.

Now the disk usage of the NEW nodes are very high, some at 98%

xid2-storage-60-77c9b7bdf4-9tf9h 340G 217G 124G 64% /var/lib/foundationdb
xid2-storage-61-57589bbd49-8hclk 340G 323G 17G 96% /var/lib/foundationdb
xid2-storage-62-8b77c5985-2sxmt 340G 322G 19G 95% /var/lib/foundationdb
xid2-storage-63-66bd79475c-pjx68 340G 322G 19G 95% /var/lib/foundationdb
xid2-storage-64-5ffdb86985-tgfmd 340G 332G 8.0G 98% /var/lib/foundationdb
xid2-storage-65-ddbb4f649-xn95w 340G 323G 18G 95% /var/lib/foundationdb
xid2-storage-66-68664688bf-k6pc7 340G 332G 8.0G 98% /var/lib/foundationdb
xid2-storage-67-5f479cfd49-fblwr 340G 329G 11G 97% /var/lib/foundationdb
xid2-storage-68-56db7648b5-d94gb 340G 261G 80G 77% /var/lib/foundationdb
xid2-storage-69-74dcb9c8f-fdxxn 340G 332G 8.0G 98% /var/lib/foundationdb
xid2-storage-70-784cb9cb77-9cv8s 340G 260G 81G 77% /var/lib/foundationdb

I have restarted the data distributor process once. It doesn’t help.
What should I do to solve this? Thanks a lot.

How much disk space is used on your existing processes? A potential good short-term remedy is to exclude the very full processes, of which it appears you have 8 in this list above 95%.

Disk usages at old nodes are at 60%, 70%.

I don’t know what has gone wrong in the cluster, but if this has caused your cluster to lockdown I think your quickest path to a resolution is to exclude these processes. Normally I might also recommend adding more processes to be safe from filling up the other processes too much, but I’m not sure what to expect from that in your case.

What version are you running?

I’ll note that one other hammer you can use if needed is to migrate your cluster to an entirely new set of processes. That would entail adding an entirely new set of processes to your cluster and excluding all of the existing ones. Often this can resolve issues that are tough to work around otherwise, but without knowing what has happened in your cluster I couldn’t say for sure whether you’d run into this problem again.

Newer versions of FoundationDB have made improvements to data distribution that are intended to help things stay better balanced, and in particular to avoid fuller processes from continuing to take on excess data, so it’s possible an upgrade could be of benefit too, depending on your version.

We are using v6.2.11.

I excluded 8 storage processes (4 nodes) that had errors in “status details”.

Ok, there are a number of improvements related to data distribution, including better protection of processes with lower disk space, that were introduced in later patch releases in 6.2. If it’s feasible and something you’re willing to try, you could consider upgrading to the most recent 6.2 release once your cluster is healthy again.

The cluster is stuck now. The “Data” status section is like this:

Data:
Replication health - unknown
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - 32.869 TB

The cluster has 2-region, 3-DC architecture. I am thinking if I can break the connection between the 2 regions (change priority and usable regions). Let Region 1, which consisting of DC1 and DC2, be usable, but disable Region 2 (consisting of DC3). Will this bring the cluster back into usable state?

Afterwards, I’ll reenable usable regions to 2, and let it do a full replication from the 1st region to the 2nd region.

What do you think?

The exclusion pods are still shown up at status details

  <ip1_hidden>:4500:tls (  5% cpu;  5% machine; 0.001 Gbps;  0% disk IO; 3.2 GB / 130.7 GB RAM  )
    Last logged error: StorageServerFailed: io_error at Wed Aug 12 13:48:46 2020
  <ip1_hidden>:4501:tls (  5% cpu;  5% machine; 0.001 Gbps;  0% disk IO; 3.2 GB / 130.7 GB RAM  )
    Last logged error: StorageServerFailed: io_error at Wed Aug 12 13:48:59 2020
  <ip2_hidden>:4500:tls (  4% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 3.0 GB / 133.4 GB RAM  )
    Last logged error: StorageServerFailed: io_error at Wed Aug 12 14:30:33 2020
  <ip2_hidden>:4501:tls (  4% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 3.0 GB / 133.4 GB RAM  )
    Last logged error: StorageServerFailed: io_error at Wed Aug 12 14:30:27 2020
  <ip3_hidden>:4500:tls (  6% cpu;  3% machine; 0.002 Gbps;  0% disk IO; 3.1 GB / 135.6 GB RAM  )
    Last logged error: StorageServerFailed: io_error at Wed Aug 12 15:28:04 2020
  <ip3_hidden>:4501:tls (  6% cpu;  3% machine; 0.002 Gbps;  0% disk IO; 3.1 GB / 135.6 GB RAM  )
    Last logged error: StorageServerFailed: io_error at Wed Aug 12 15:28:06 2020

Is it normal? How can we check whether the rebalancing is making progress, given that the status is not showing details

Replication health - unknown
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - 32.869 TB

What do the trace logs say the IO error is? I don’t remember what happens if the disks get completely full, but if that has happened and these processes are solely responsible for at least some of your data, it may be necessary to delete other files on these disks or move the data to larger disks so that they can recover.

Given that you have the 2-region setup though, you should have all of your data in the first region, and maybe dropping the second region as you suggest is a good way to go.

I fixed the issue by removing nodes/pods that had higher than 95% disk usage. After the last such node was removed, FDB became healthy almost instantly.

It looks like that FDB sets up a threshold of 95% disk usage internally. When that threshold is crossed with any node in the cluster, FDB “limits/suspends” itself, and becomes unresponsive (in our case, the cluster was not usable by apps).

This is unexpected. This is like any node in the cluster can become a single point of failure for the cluster in terms of usability.

Can you check with the FDB dev team how the cluster is supposed to behave in such a case in v6.2.11? Has the behavior changed in a later patch or version?

Similar question: has the behavior of having some nodes getting really high disk usage changed since v6.2.11?

Thank you.

I would not look at that as single point of failure, but it is a safe guard. Not just FDB, but for any database if the disk is full database can get into very bad state depending when exactly disk becomes full.

If any node reached 95%, then cluster is in a very bad state. Database has two options at this point

  • Continue to accept requests until disk is 100% full and risk getting into a unrecoverable state
  • Stop accepting client requests causing outage, but protect the database to give chance to DBAs to recover the cluster

FDB chooses later.

Ideally, DBAs should have some alerts to notify disk getting full much sooner than 95%, to avoid this situation.

1 Like

Another option is for data-distributor to take available disk in account and move away shards from the node that is more full than some accepted threshold. Systems like HDFS have this behavior.

Good point.

However, data distributor should be doing that (balancing across all nodes) all the time not just when nodes are very full. If we get to this place then something is wrong with data distributor and it didn’t make correct decisions. Fixes in 6.215-6.2.20 should help with that behavior.

I think one of the fix was along the lines, what you suggested. IIRC, fix was a scheme when a server crosses a threshold, data distributor stops using teams with that server as destination. Evan would know details better.

Yes, this is one of the fixes I was talking about since 6.2.11, it avoids putting data on processes that are getting full and are also fuller than other processes.

I’m glad doing this worked. I think it’s a lighter weight variant of dropping the whole region, so it was a good choice. I think what happens in this case is that removing all of the processes may have (but not necessarily) eliminated all copies of some data in the region, in which case it should re-replicate from the other region.

If you were only running one region, a possible outcome of this procedure could have been that all copies of some data had been taken offline. In that case, status should at least report that fact to you, so you would know that you need to find a way to get at least some of the processes back.