Batch performance limited due to long fetchkeys?

Hi, I’m on 6.3.15. I’ve noticed that during heavy data distribution activity, my cluster goes into batch performance limited due to high durability lag. The data distribution is due to adding and removing some storage fdbservers. The fdbserver with high durability lag does not seem resource bound in cpu/io/etc. I looked through the logs and I think it’s happening due to long FetchKeys. There are fetchkeys taking very long:

<Event Severity="30" Time="1647551568.457695" DateTime="2022-03-17T21:12:48Z" Type="FetchKeyCurrentStatus" ID="0000000000000000" Timestamp="1.64755e+09" LongestRunningTime="362.306" StartKey="<key>" EndKey="<key2>" NumRunning="83" Machine="" LogGroup="default" Roles="SS" />

The StorageMetrics say that were always maxing out the fetch keys parallelism (FetchKeysFetchActive="4000000").

Looking at Storage Server Shard Boundary Change, it seems like this could be happen if the shard being moved is getting a high rate of new updates. The shard that was being moved is a very hot write range. Could this be why the cluster was batch performance limited? If so, are there knobs anyone recommends tweaking? I found these:

init( FETCH_BLOCK_BYTES,                                     2e6 );
init( FETCH_KEYS_PARALLELISM_BYTES,                          4e6 ); if( randomize && BUGGIFY ) FETCH_KEYS_PARALLELISM_BYTES = 3e6;
init( FETCH_KEYS_LOWER_PRIORITY,                               0 );

My understanding is lower_priority would make it worse. For block_bytes and parallelism_bytes, which would help my situation? I’m not sure if it would be better to try to increase throughput so that FetchKeys doesn’t take as long, or decrease fetch batch size so there aren’t as many updates within each fetch.

Note that we’ve also had the following knobs for a long time - These were to try to slow data distribution. Some of the FDB clients are writing in a sliding window pattern, so the frequent writing of new keys + deleting of old keys resulted in constant high data distribution load, adding io pressure.

knob_dd_move_keys_parallelism = 7
knob_dd_rebalance_parallelism = 25
knob_relocation_parallelism_per_source_server = 1

I wonder if this is the problem fixed by Unfortunately, the fix is in the main branch, not cherrypicked to 6.3 yet.

Ah, thanks! The description in that PR sounds exactly like what we hit. The durability lag didn’t seem to improve at all on its own, we finally got the cluster to recover by manually killing the fdbserver (to stop the data distribution job). It also sounds a lot like this could explain why we’ve seen io timeouts for fairly healthy/well-resourced clusters upon upgrades.

Will this be backported to 6.3? Is there anything we can do in the meantime to avoid it - And is killing the affected fdbserver the right way to resolve it?