Data Distribution Stopped - How to Restart?

lehu · September 12, 2019, 9:07pm

I am doing resiliency tests. Two days ago I brought down 3 nodes and then added 3 nodes back in a batch. Both the removal and the addition ops worked fine. However, FDB got stuck when I repeated similar ops later.

Yesterday I brought down 4 nodes, and data were redistributed.The system was fine. Then today I added 4 nodes back in a batch using a script. Data distribution started. For about 25min, the new 4 nodes got about 10GB of increased disk usage. Then suddenly FDB stopped the data movement for the 4 new nodes. See charter below.

Other storage nodes have 210GB disk usage on average. The 4 new nodes only get 10GB, and DD has not moved data for last 4 hours.

What can cause this? How can I debug this situation? How to restart DD?

Thank you.

Leo

ajbeamon · September 12, 2019, 9:47pm

What version are you running?

There is one known issue where DD rebalancing stops making progress until it gets restarted. See https://github.com/apple/foundationdb/issues/1884.

There are also various other fixes in 6.1 and the to-be-released 6.2 that could affect data balance.

My suggestion would be to first test whether restarting data distribution helps by restarting the process running the data distributor role, or if you are on an old version the process running the master role. If you wanted, you could also just restart every process in the cluster by executing something like kill; kill all; status in fdbcli.

Hopefully the result of restarting data distribution is that the cluster will make more progress towards being balanced, but if not then there may be another issue involved.

lehu · September 12, 2019, 11:21pm

Hi AJ, we are using V6.0.15. I located the Master process and restarted it. I am happy to report it’s moving again!

I appreciate your prompt assistance. It’s a big help!

Leo

lehu · September 13, 2019, 4:10pm

Oops, we encountered another issue with data distribution: unevenness among nodes.

I restarted DD at 15:45 yesterday, it went on and moved data for near 3 hours. t 18:45 Moving-data-in-flight was 0.

At 22:00 I saw DD stopped, but found the data distribution was still not even. I tried to restart DD multiple times, each time DD only moved a tiny bit and then stopped.

Jun and I noticed yesterday afternoon that the data distribution was not even among nodes, by calculating the disk usage percentage of each node. Last night he was able to construct a graph showing the historical data distribution evenness among nodes, out of our Prometheus metrics data depository.

He described the graph as follows:
“You can clearly see that after data loading, we have very uniform distribution. Then in the middle of the week, rebalancing is also pretty good. After that, it starts to getting worst and worst.”

It seems that data distribution unevenness worsened and fanned out as we removed and added nodes with our resilience tests.

Is this an issue with FDB v6.0.15 we are using? Has it been addressed and improved at a later version? Thanks.

Leo

lehu · September 17, 2019, 1:49am

Over the weekend, FDB had worked and the cluster data distribution uniformity converged and became quiet even, without any actions from us. Then today, after we removed and added some nodes, the distribution diverged again. Here is the charter for last 3 days.

This behavior is a big concern for us to go production soon. Please advise on the following:

We don’t know how long it takes for DD to converge. Do you have prior data regarding the timing?
In the issue ticket #1884 mentioned above, “hgray1 added this to the 7.0 milestone on Jul 29”. That’s too far for us. Do you have other suggestions for us to deal with the problem?
What are relevant changes for DD in V6.1 and V6.2? When will V6.2 be out? Should we upgrade? What can an upgrade make?

Thank you.

Leo

alexmiller · September 17, 2019, 2:11am

From the Y-axis, I’m guessing that this is a chart of disk fullness?

Part of data distribution working means that it’s going to shuffle data around, and some of that will cause sqlite files to grow on disk, and then be trimmed down over time. The amount of data each storage server is responsible for will converge much quicker than that.

My vague memory is that we had a similar conversation about this before, but I can’t seem to find it on the forums. The changes that I recall discussing back then made it into 6.1. To quote from the 6.1 release notes:

Increase the rate that deleted pages are made available for reuse in the SQLite storage engine. Rename and add knobs to provide more control over this process. [6.1.3] (PR #1485)

SQLite page files now grow and shrink in chunks based on a knob which defaults to an effective chunk size of 100MB. [6.1.4] (PR #1482) (PR #1499)

So upgrading to 6.1 would also get your storage servers to clean up their disk space 10x faster, or you could set --knob_cleaning_inerval=0.1 to see if this helps now on 6.0, but 6.1 turns what this knob controls out into two values, and sets the default lazy cleanup rate higher, so running 6.1 would be preferable. Either should probably make this particular graph look better.

The graph you’ve shown doesn’t look like data distribution being stuck. If you see this stop making progress, and then immediately start to make progress if you restart the data distribution process, then that would be what this issue is about.

This issue isn’t also about trying to minimize how much the file sizes on existing hosts might increase when you add new servers. It’s known behavior that happen, and is done somewhat intentionally as we prioritize hosting data on the most diverse set of servers over minimizing the total amount of data movement that will happen. (But work could probably be done in the future to do less movement, or do it more carefully.)

There are release notes published for each release summarizing most changes. The release notes for 6.1 are published in the documentation, whereas for 6.2 they’re only in the code right now, as we haven’t marked a 6.2 patch release as stable and updated the website yet.

ajbeamon · September 17, 2019, 9:35pm

The value that data distribution is trying to balance and which should be converging more quickly is the logical bytes stored on each storage server, and that can be obtained from the trace logs by looking at the BytesStored field in the StorageMetrics event. It can also be obtained from JSON status by looking at cluster.processes[<id>].roles[<i>].stored_bytes for any role where cluster.processes[<id>].roles[<i>].role is storage.

lehu · September 19, 2019, 9:23pm

Hi Alex,
Thank you for the insight. Yes, the Y-axis is disk fullness in percentage.

How do I set knob_cleaning_inerval in v6.0?

We may consider to upgrade. Since v6.2 is now in pre-release, we may take a look to upgrade to v6.2 as well if timing is right. Do you have any estimated time for its general release? If the release time changes in the future, I wont hold you accountable. Just for planning.

AJ,
We will try to pull the logical bytes into our dashboard. Thanks for the info.

Leo

lehu · September 19, 2019, 9:28pm

In the FIRST data distribution charter above, the disk fullness was expressed in decimal, from 0 to 1. For example, 0.7 is 70%.

In the SECOND data distribution charter, we converted it to true percentage, where 70 is 70%.

alexmiller · September 19, 2019, 9:30pm

You pass --knob_cleaning_interval=0.1 on the command line to your fdbserver processes.

O(weeks), and probably a small number of weeks. The last phases of internal validation of the release is being done before offering it for public download.

lehu · September 19, 2019, 9:57pm

Alex, got it. Nice to know v6.2 is coming out soon. Thank you. – Leo

rjenkins · September 20, 2019, 2:24am

Hi Folks,

Just wanted to share our experience with --knob_cleaning_interval=0.1 over here at Segment. We’re currently running 6.0.18 and ran into similar issues earlier this year. We saw some cases of uneven distribution as well, but the biggest problem we were facing was relatively long periods, (generally 6 days) to reclaim space on storage nodes after adding capacity to the cluster. I’m pleased to report the --knob_cleaning_interval changes were very helpful.

Below is an example of adding 3 storage nodes on April 27th and taking well into May 3rd/4th to reclaim space.

Screenshot%20from%202019-09-19%2021-09-08

After adding --knob_cleaning_interval=0.1 we observed much faster space reclaiming of disk space.

Below is an example of 2 separate events where we of added several storage nodes, reclaiming disk space across the storage tier now completes in hours vs. days.

Screenshot%20from%202019-09-19%2021-09-56

We add the flag directly in the /etc/foundationdb/foundationdb.conf file as follows

command = /usr/sbin/fdbserver --knob_cleaning_interval=0.1.

-Ray

lehu · September 23, 2019, 8:37pm

Hi Ray, thanks for sharing your experience. It’s nice to know your solution worked! – Leo

rjenkins · November 12, 2019, 7:57pm

Hiya Folks,

Following up on this, it appears we are also still hitting https://github.com/apple/foundationdb/issues/1884. Based on our observations from our previous cluster expansion (^^see above) we thought our latest configuration and knob changes had resolved our data distribution and cleanup issues. However, last week we expanded our cluster and we are observing slow data distribution (relative to previous cluster expansions). To recap, on our last cluster expansion we added the --knob_cleaning_interval=0.1 and are observing much quicker storage cleanup, this is great!. For whatever reason, on our last expansion data distribution progressed and reached equilibrium very quickly. On that previous expansion we added 6 new storage instances in 2 batches of 3 nodes.

The following is a chart of the trace logs BytesStored as we expanded the cluster, data distribution reached equilibrium, across all storage nodes, quite quickly.

Screenshot%20from%202019-11-12%2012-48-42

We can also observe the trace logs, data in-flight, during this period.

Screenshot%20from%202019-11-12%2012-52-35

As expected with the new cleaning interval setting we recovered disk space very quickly.

Screenshot%20from%202019-11-12%2012-49-00 .

However, on this latest cluster expansion, reaching BytesStored equilibrium is taking weeks rather than hours. On 11/04/2019 we added 3 new storage nodes. From the 4th to the 5th approximately only 2GB of keyspace was allocated to the new nodes as visible by the BytesStored metric.

Screenshot%20from%202019-11-12%2012-57-41

One 11/5/2019 we restarted the fdbserver processes on the master node (per @ajbeamon’s suggestion above) and data distribution started moving more aggressively again. However after 2 hours distribution slowed down again.

Screenshot%20from%202019-11-12%2013-09-08

Screenshot%20from%202019-11-12%2013-09-29 .

Today we tried restarting the master process again to see if we could trigger an increase in BytesStored across the new nodes, but it has had no effect.

We have noticed during periods of increased write volume from our applications, data distribution rates increase, this is good. However the unpredictability of distribution is somewhat problematic for us. Specifically, there are times we would prefer to scale out and achieve cluster storage equilibrium before we on-board new customers, as these customers can add significant read/write load to the system.

During periods of data balancing we observe an increase in latency which can cause lag in processing. Therefore we would prefer to scale and reach equilibrium before we add more load to the cluster. Finally we have observed cluster-wide performance drop drastically when a storage node becomes full, so we’d like to avoid that scenario. We could of course simply trust that the system will more aggressively distribute data when load is added, but that could cause back pressure and additional latency resulting in degradation of our services and we’d like to avoid that if possible.

It’s not clear to us from the above if upgrading to 6.2 would provide any improvement or if those changes only relate to reclaiming disk space, which we’ve addressed with the knob_cleaning_interval changes.

Is there anything we can do to help gather data for https://github.com/apple/foundationdb/issues/1884. cc: @rbranson

Topic		Replies	Views
Data distribution control, monitor and pause Using FoundationDB	2	806	May 23, 2018
Debugging Data Distribution Using FoundationDB	3	910	November 14, 2018
Data distribution and rebalancing Using FoundationDB	3	1291	October 28, 2023
DD(data_distributor) process does not work in large clusters Using FoundationDB	1	410	May 5, 2023
Fdbdr done but created a huge storage lag Running FoundationDB performance	0	398	May 4, 2022

Data Distribution Stopped - How to Restart?

Related topics