FoundationDb stucked excluding a node

COran · February 28, 2024, 3:38pm

Hi,

We have a weird situation where some of our FoundationDb clusters are stucked to exclude a node. Here the symptoms after excluding a node :

the moving_data.in_flight_bytes is plateauing for hours
during the time we had constant moving_data.in_flight_bytes, we did look at the traces related to *shard* events. The outlier was clearly GetShardStateReadyDD
killing the DataDistributor fixed the issue

Notes : re-including and re-excluding the nodes is not helping

Some context about the cluster we are running :

Do you have any tips how to troubleshoot this kind of issue ? Do you have any though about what could be the root cause ?

Topic		Replies	Views
Constant Data Movement Using FoundationDB	2	603	January 14, 2019
Cluster stuck in recovery after crash of one node Using FoundationDB	1	550	March 18, 2022
Database unavailable after shutting down a foundationdb node Using FoundationDB	17	8545	February 5, 2021
How would I recover from this failed cluster move? Running FoundationDB	11	579	October 16, 2024
How to recover fdb database from attempt of excluding the single sattellite node? Using FoundationDB	10	818	October 9, 2020