This is running on 5.2 but on one 4-node cluster, there’s constant 100% disk saturating data movement which doesn’t seem to be going anywhere (always with a residual 4G-12G of data movement left at healthy). Looking at “SendRelocateToDDQx100” messages, there seems to be shards with teams of 3 to 4 processes while the cluster is only doubly redundant. Adding 4 new nodes and excluding the older 4 never completes either (and the constant disk activity continues). Seems like the system isn’t able to figure out that a data move has been completed and just tells storage servers to constantly move data over.
Do you know if the master died when you tried the exclude? If not, what happens if you kill it?
(also sent you an email)
We did reboot the cluster a number of times and it goes to a weird 40T data movement state (cluster has 8T of KVs) which quickly drops down but it just stays constantly moving data afterwards. It’s almost like it’s trying to move data to all nodes.