I am doing resiliency tests. Two days ago I brought down 3 nodes and then added 3 nodes back in a batch. Both the removal and the addition ops worked fine. However, FDB got stuck when I repeated similar ops later.
Yesterday I brought down 4 nodes, and data were redistributed.The system was fine. Then today I added 4 nodes back in a batch using a script. Data distribution started. For about 25min, the new 4 nodes got about 10GB of increased disk usage. Then suddenly FDB stopped the data movement for the 4 new nodes. See charter below.
Other storage nodes have 210GB disk usage on average. The 4 new nodes only get 10GB, and DD has not moved data for last 4 hours.
What can cause this? How can I debug this situation? How to restart DD?