Following up on this, it appears we are also still hitting https://github.com/apple/foundationdb/issues/1884. Based on our observations from our previous cluster expansion (^^see above) we thought our latest configuration and knob changes had resolved our data distribution and cleanup issues. However, last week we expanded our cluster and we are observing slow data distribution (relative to previous cluster expansions). To recap, on our last cluster expansion we added the
--knob_cleaning_interval=0.1 and are observing much quicker storage cleanup, this is great!. For whatever reason, on our last expansion data distribution progressed and reached equilibrium very quickly. On that previous expansion we added 6 new storage instances in 2 batches of 3 nodes.
The following is a chart of the trace logs BytesStored as we expanded the cluster, data distribution reached equilibrium, across all storage nodes, quite quickly.
We can also observe the trace logs, data in-flight, during this period.
As expected with the new cleaning interval setting we recovered disk space very quickly.
However, on this latest cluster expansion, reaching BytesStored equilibrium is taking weeks rather than hours. On 11/04/2019 we added 3 new storage nodes. From the 4th to the 5th approximately only 2GB of keyspace was allocated to the new nodes as visible by the BytesStored metric.
One 11/5/2019 we restarted the fdbserver processes on the master node (per @ajbeamon’s suggestion above) and data distribution started moving more aggressively again. However after 2 hours distribution slowed down again.
Today we tried restarting the master process again to see if we could trigger an increase in BytesStored across the new nodes, but it has had no effect.
We have noticed during periods of increased write volume from our applications, data distribution rates increase, this is good. However the unpredictability of distribution is somewhat problematic for us. Specifically, there are times we would prefer to scale out and achieve cluster storage equilibrium before we on-board new customers, as these customers can add significant read/write load to the system.
During periods of data balancing we observe an increase in latency which can cause lag in processing. Therefore we would prefer to scale and reach equilibrium before we add more load to the cluster. Finally we have observed cluster-wide performance drop drastically when a storage node becomes full, so we’d like to avoid that scenario. We could of course simply trust that the system will more aggressively distribute data when load is added, but that could cause back pressure and additional latency resulting in degradation of our services and we’d like to avoid that if possible.
It’s not clear to us from the above if upgrading to 6.2 would provide any improvement or if those changes only relate to reclaiming disk space, which we’ve addressed with the
Is there anything we can do to help gather data for https://github.com/apple/foundationdb/issues/1884. cc: @rbranson