DC1 and DC3 got out of sync

URGENT HELP NEEDED, Please!

We have a 2-region, 3-DC fdb cluster that seems getting out of sync between DC1 and DC3. When we query the primary_datacenter, we get an error.

fdb> get \xff/primaryDatacenter # (edited)
ERROR: Request for future version (1009)

Datacenter Version Difference is a huge number like 15079903111970.

How can we make it sync up?

We were upgrading the underlying Kubernetes host OS upon which we deployed fdb pods. That might have affected the db.

Your prompt help is highly appreciated.

primary_datacenter you have a typo there

Good catch, Sher.
I copied an old, wrong cmd. It should be like this:
fdb> get \xff/primaryDatacenter
ERROR: Request for future version (1009)

I edited my original post to eliminate confusion. Though our problem remains, thank you.

More details:

The DC version difference remains the same this morning.
Accessing “status json” takes longer and time out often.
A little change in status: from “Healthy (Repartitioning)” to “HEALING”.

Last night the cluster was in this state:

Fault Tolerance - 1 zone (2 without data loss) …
Replication health - Healthy (Repartitioning.)
Moving data - 10.087 GB
Sum of key-value sizes - 3.988 TB
Disk space used - 31.838 TB

This morning:

Fault Tolerance - 1 zone (2 without data loss) …
Replication health - HEALING: Only two replicas remain of some data
Moving data - 19.634 GB
Sum of key-value sizes - 3.988 TB
Disk space used - 31.745 TB

We found that a TX pod has very high number (7000) of error msgs of severity 30 of Type=“TooManyConnectionsClosed” ". Here is an exmaple:

trace.10.104.218.193.4000.1603173783.uuesG5.0.1.xml:<Event Severity="30" Time="1603173806.861476" Type="TooManyConnectionsClosed" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="10.175.151.130:4500:tls" Machine="10.104.218.193:4000" LogGroup="default" />

What does it mean?
What’s the proper action upon the pod?
Thanks.

We decided to restart the pod. After we shut down the bad-behaving pod/node, the fdb cluster recovered pretty quickly, from HEALING to Repartitioning to Rebalancing, and to totally Healthy, in about 5 mins.

The DC version difference returned to normal (<5M). The query in PrimaryDatacenter was fine. So the cluster is fine now.

1 Like