FoundationDb stucked excluding a node


We have a weird situation where some of our FoundationDb clusters are stucked to exclude a node. Here the symptoms after excluding a node :

  • the moving_data.in_flight_bytes is plateauing for hours
  • during the time we had constant moving_data.in_flight_bytes, we did look at the traces related to *shard* events. The outlier was clearly GetShardStateReadyDD
  • killing the DataDistributor fixed the issue

Notes : re-including and re-excluding the nodes is not helping

Some context about the cluster we are running :

  • the FoundationDb version is 7.2.9
  • the FoundationDb cluster is running in three_data_hall on 9 nodes
  • the FoundationDb is running the redwood storage engine

Do you have any tips how to troubleshoot this kind of issue ? Do you have any though about what could be the root cause ?