DC1 and DC3 got out of sync

lehu · October 20, 2020, 10:43am

URGENT HELP NEEDED, Please!

We have a 2-region, 3-DC fdb cluster that seems getting out of sync between DC1 and DC3. When we query the primary_datacenter, we get an error.

fdb> get \xff/primaryDatacenter # (edited)
ERROR: Request for future version (1009)

Datacenter Version Difference is a huge number like 15079903111970.

How can we make it sync up?

We were upgrading the underlying Kubernetes host OS upon which we deployed fdb pods. That might have affected the db.

Your prompt help is highly appreciated.

sher · October 20, 2020, 2:13pm

primary_datacenter you have a typo there

lehu · October 20, 2020, 3:35pm

Good catch, Sher.
I copied an old, wrong cmd. It should be like this:
fdb> get \xff/primaryDatacenter
ERROR: Request for future version (1009)

I edited my original post to eliminate confusion. Though our problem remains, thank you.

lehu · October 20, 2020, 4:13pm

More details:

The DC version difference remains the same this morning.
Accessing “status json” takes longer and time out often.
A little change in status: from “Healthy (Repartitioning)” to “HEALING”.

Last night the cluster was in this state:

Fault Tolerance - 1 zone (2 without data loss) …
Replication health - Healthy (Repartitioning.)
Moving data - 10.087 GB
Sum of key-value sizes - 3.988 TB
Disk space used - 31.838 TB

This morning:

Fault Tolerance - 1 zone (2 without data loss) …
Replication health - HEALING: Only two replicas remain of some data
Moving data - 19.634 GB
Sum of key-value sizes - 3.988 TB
Disk space used - 31.745 TB

lehu · October 20, 2020, 7:42pm

We found that a TX pod has very high number (7000) of error msgs of severity 30 of Type=“TooManyConnectionsClosed” ". Here is an exmaple:

trace.10.104.218.193.4000.1603173783.uuesG5.0.1.xml:<Event Severity="30" Time="1603173806.861476" Type="TooManyConnectionsClosed" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="10.175.151.130:4500:tls" Machine="10.104.218.193:4000" LogGroup="default" />

What does it mean?
What’s the proper action upon the pod?
Thanks.

lehu · October 21, 2020, 12:56am

We decided to restart the pod. After we shut down the bad-behaving pod/node, the fdb cluster recovered pretty quickly, from HEALING to Repartitioning to Rebalancing, and to totally Healthy, in about 5 mins.

The DC version difference returned to normal (<5M). The query in PrimaryDatacenter was fine. So the cluster is fine now.

Topic		Replies	Views
Database version difference Using FoundationDB	1	326	June 13, 2019
3DC2regions--Simulating Primary Datacenter Failure Using FoundationDB bindings	7	539	June 12, 2025
When configure three_datacenter_fallback,the cluster Replication health remains in (Re)initializing automatic data distribution Using FoundationDB operator	0	343	May 5, 2023
Multi DC replication fails during DR test Kubernetes Operator operator	16	598	May 29, 2024
Run FoundationDB cluster on multi Kuberbetes clusters Kubernetes Operator	20	1791	February 6, 2023

DC1 and DC3 got out of sync

Related topics