Data Distribution Stopped - How to Restart?

I am doing resiliency tests. Two days ago I brought down 3 nodes and then added 3 nodes back in a batch. Both the removal and the addition ops worked fine. However, FDB got stuck when I repeated similar ops later.

Yesterday I brought down 4 nodes, and data were redistributed.The system was fine. Then today I added 4 nodes back in a batch using a script. Data distribution started. For about 25min, the new 4 nodes got about 10GB of increased disk usage. Then suddenly FDB stopped the data movement for the 4 new nodes. See charter below.

Other storage nodes have 210GB disk usage on average. The 4 new nodes only get 10GB, and DD has not moved data for last 4 hours.

What can cause this? How can I debug this situation? How to restart DD?

Thank you.

Leo

What version are you running?

There is one known issue where DD rebalancing stops making progress until it gets restarted. See https://github.com/apple/foundationdb/issues/1884.

There are also various other fixes in 6.1 and the to-be-released 6.2 that could affect data balance.

My suggestion would be to first test whether restarting data distribution helps by restarting the process running the data distributor role, or if you are on an old version the process running the master role. If you wanted, you could also just restart every process in the cluster by executing something like kill; kill all; status in fdbcli.

Hopefully the result of restarting data distribution is that the cluster will make more progress towards being balanced, but if not then there may be another issue involved.

Hi AJ, we are using V6.0.15. I located the Master process and restarted it. I am happy to report it’s moving again!

I appreciate your prompt assistance. It’s a big help!

Leo

Oops, we encountered another issue with data distribution: unevenness among nodes.

I restarted DD at 15:45 yesterday, it went on and moved data for near 3 hours. t 18:45 Moving-data-in-flight was 0.

At 22:00 I saw DD stopped, but found the data distribution was still not even. I tried to restart DD multiple times, each time DD only moved a tiny bit and then stopped.

Jun and I noticed yesterday afternoon that the data distribution was not even among nodes, by calculating the disk usage percentage of each node. Last night he was able to construct a graph showing the historical data distribution evenness among nodes, out of our Prometheus metrics data depository.

He described the graph as follows:
“You can clearly see that after data loading, we have very uniform distribution. Then in the middle of the week, rebalancing is also pretty good. After that, it starts to getting worst and worst.”

It seems that data distribution unevenness worsened and fanned out as we removed and added nodes with our resilience tests.

Is this an issue with FDB v6.0.15 we are using? Has it been addressed and improved at a later version? Thanks.

Leo