I have misconfigured some of our nodes to use wrong disk and when this nodes became full, our cluster came to a full stop. It reports in fdbcli, but tx commit frequency became only a 1 hz making cluster basicaly dead.
Also since this disks are, unfortunately, HDD it seems it would take a while to exclude them from the cluster.
This behaviour is extremely surprising. Does this mean that we must have a lot extra capacity and all nodes have to use disks of the same size?
UPD: Eventually cluster became “Healthy (Repartitioning)”, but still can’t answer to the queries.
UPD2: Also i have to exclude them forcibly since i am getting “ERROR: This exclude may cause the total free space in the cluster to drop below 10%.”, but that’s not true at all.
UPD3: Removing the one that had zero available space AND restarting cluster (kill all
) worked. Adding this node back to the cluster immediately kills the cluster.