Hi,
We have a weird situation where some of our FoundationDb clusters are stucked to exclude a node. Here the symptoms after excluding a node :
- the
moving_data.in_flight_bytes
is plateauing for hours - during the time we had constant
moving_data.in_flight_bytes
, we did look at the traces related to*shard*
events. The outlier was clearlyGetShardStateReadyDD
- killing the
DataDistributor
fixed the issue
Notes : re-including and re-excluding the nodes is not helping
Some context about the cluster we are running :
- the FoundationDb version is 7.2.9
- the FoundationDb cluster is running in
three_data_hall
on 9 nodes - the FoundationDb is running the redwood storage engine
Do you have any tips how to troubleshoot this kind of issue ? Do you have any though about what could be the root cause ?