Seeing 'FinishMoveKeysTooLong' and 'RelocateShardTooLong' while data rebalancing

Hello,

We added couple of new nodes with extra storage processes to an existing 5 node cluster.
Occasionally seeing this error in the log. Not sure what it means. Could someone help me in understanding more? Is this something I need to be worried about?

FDB version - FoundationDB 6.2 (v6.2.15)

Event Severity=“30” Time=“1595560530.339613” Type=“RelocateShardTooLong” ID=“0000000000000000” Duration=“621.046” Dest=“150305ac25ba33d73affbcc775623b27,8c591274b23a33346823f3d74e354919” Src=“6262d0fd036dad4ffd2a9d09006e7716,6e22f6457d66f4acd9c6606649a85410” Machine=“172.41.21.21:4510” LogGroup=“default” Roles=“DD”
Event Severity=“30” Time=“1595560544.320335” Type=“FinishMoveKeysTooLong” ID=“0000000000000000” Duration=“600” Servers=“8c591274b23a33346823f3d74e354919,c5d02c8049f05de123f951ed57ae255e” Machine=“172.41.21.21:4510” LogGroup=“default” Roles=“DD”

Data distribution rebalance data across nodes by relocating shards.
The message means it takes too long (10min) to relocate a shard.

This may be caused by:

  1. The source storage servers are busy and don’t have extra bandwidth to handle the relocate shard workload. You need to reduce the client traffic;
  2. The storage servers may not have enough CPU to handle the traffic.
  3. Network between the nodes may have low bandwidth.

Thank You @mengxu for the quick response. There was a CPU spike and disk were under heavy utilization around the time of errors. But eventually those settled down.
From what you explained, I assume it is a warning that shard relocation is taking too long, but eventually it will be relocated. Forgive my ignorance here, but does this have any other bad effect on the system apart from longer data rebalance time?

Severity=30 messages are warnings instead of errors.

Data rebalance affects:

  1. The disk utilization on each node, and
  2. load balancing on read and write traffic. If a shard is not balanced to the destination storage servers quickly, that shard may become hot shard (i.e., too much traffic to the single shard). Hot shard is bad for a cluster because the cluster’s Ratekeeper will eventually kicks in and throttle the cluster’s traffic.

Back to your question, the behavior you saw is a hazard to the cluster’s health because it takes longer time to balance load.

Understood and thank you @mengxu.

We had a 5 node cluster (2*1.9TB nvme/node) with 28 storage processes (working on top of 7 nvme) and 3 Tx process (working on rest of 3 nvme) in total. Cluster had ~2.9TB of KV data and ~6TB (disk used, RF=2).
Added 2 more similar nodes with additional 8 storage processes (4/nvme) and 2 stateless processes on each new node.

We added 2 nodes in one go, is that ok or is it recommended to add one node at a time? Is there any recommendations with regards to tunings we should do?

It is better to add multiple storage server nodes at once. So what you did is preferred.

Say the replication factor is k (which is 2 in your case), it’s better to add >=k storage servers at once to expand the cluster. Otherwise, some existing storage server may see temporary storage usage increase when the new SS is added.

For tLog server, it does not matter that much. But adding them all at once can help reduce the number of cluster recoveries, so IMO it is preferred.

1 Like