The DC version difference remains the same this morning.
Accessing “status json” takes longer and time out often.
A little change in status: from “Healthy (Repartitioning)” to “HEALING”.
Last night the cluster was in this state:
Fault Tolerance - 1 zone (2 without data loss) …
Replication health - Healthy (Repartitioning.)
Moving data - 10.087 GB
Sum of key-value sizes - 3.988 TB
Disk space used - 31.838 TB
This morning:
Fault Tolerance - 1 zone (2 without data loss) …
Replication health - HEALING: Only two replicas remain of some data
Moving data - 19.634 GB
Sum of key-value sizes - 3.988 TB
Disk space used - 31.745 TB
We decided to restart the pod. After we shut down the bad-behaving pod/node, the fdb cluster recovered pretty quickly, from HEALING to Repartitioning to Rebalancing, and to totally Healthy, in about 5 mins.
The DC version difference returned to normal (<5M). The query in PrimaryDatacenter was fine. So the cluster is fine now.