Batch performance limited due to long fetchkeys?

amanda · March 18, 2022, 1:16am

Hi, I’m on 6.3.15. I’ve noticed that during heavy data distribution activity, my cluster goes into batch performance limited due to high durability lag. The data distribution is due to adding and removing some storage fdbservers. The fdbserver with high durability lag does not seem resource bound in cpu/io/etc. I looked through the logs and I think it’s happening due to long FetchKeys. There are fetchkeys taking very long:

<Event Severity="30" Time="1647551568.457695" DateTime="2022-03-17T21:12:48Z" Type="FetchKeyCurrentStatus" ID="0000000000000000" Timestamp="1.64755e+09" LongestRunningTime="362.306" StartKey="<key>" EndKey="<key2>" NumRunning="83" Machine="10.67.246.40:4510" LogGroup="default" Roles="SS" />

The StorageMetrics say that were always maxing out the fetch keys parallelism (FetchKeysFetchActive="4000000").

Looking at Storage Server Shard Boundary Change, it seems like this could be happen if the shard being moved is getting a high rate of new updates. The shard that was being moved is a very hot write range. Could this be why the cluster was batch performance limited? If so, are there knobs anyone recommends tweaking? I found these:

init( FETCH_BLOCK_BYTES,                                     2e6 );
init( FETCH_KEYS_PARALLELISM_BYTES,                          4e6 ); if( randomize && BUGGIFY ) FETCH_KEYS_PARALLELISM_BYTES = 3e6;
init( FETCH_KEYS_LOWER_PRIORITY,                               0 );

My understanding is lower_priority would make it worse. For block_bytes and parallelism_bytes, which would help my situation? I’m not sure if it would be better to try to increase throughput so that FetchKeys doesn’t take as long, or decrease fetch batch size so there aren’t as many updates within each fetch.

Note that we’ve also had the following knobs for a long time - These were to try to slow data distribution. Some of the FDB clients are writing in a sliding window pattern, so the frequent writing of new keys + deleting of old keys resulted in constant high data distribution load, adding io pressure.

knob_dd_move_keys_parallelism = 7
knob_dd_rebalance_parallelism = 25
knob_relocation_parallelism_per_source_server = 1

jzhou · March 18, 2022, 3:07am

I wonder if this is the problem fixed by https://github.com/apple/foundationdb/pull/6484. Unfortunately, the fix is in the main branch, not cherrypicked to 6.3 yet.

amanda · March 18, 2022, 6:55pm

Ah, thanks! The description in that PR sounds exactly like what we hit. The durability lag didn’t seem to improve at all on its own, we finally got the cluster to recover by manually killing the fdbserver (to stop the data distribution job). It also sounds a lot like this could explain why we’ve seen io timeouts for fairly healthy/well-resourced clusters upon upgrades.

Will this be backported to 6.3? Is there anything we can do in the meantime to avoid it - And is killing the affected fdbserver the right way to resolve it?

Topic		Replies	Views
Bulk insert 2 billion records Using FoundationDB	1	1304	March 25, 2019
Storage Server CPU bottleneck - Growing data lag Using FoundationDB performance	22	3065	December 13, 2021
What does FetchKeysTooLong trace event denote? Using FoundationDB	4	651	April 6, 2020
FoundationDB cluster performance issue - Periods of high disk I/O and sustained high latency Using FoundationDB performance	21	2559	July 6, 2020
Transaction/operation throughput Using FoundationDB performance	10	2005	January 23, 2020

Batch performance limited due to long fetchkeys?

Related topics