Locking coordination state with DR

lehu · May 17, 2022, 5:06pm

I set up a DR between two fdb clusters. The initial sync was slow with the 400GB db. After a day or two, things got stuck.

At the destination cluster, I got this:

Locking coordination state. Verify that a majority of coordination server
processes are active.

  10.X.Y.126:4500:tls  (reachable)
  10.X.Y.150:4500:tls  (reachable)
  10.X.Y.119:4000:tls  (reachable)
  10.X.Y.23:4500:tls  (reachable)
  10.X.Y.20:4000:tls  (reachable)
  10.X.Y.173:4000:tls  (reachable)
  10.X.Y.111:4500:tls  (reachable)
  10.X.Y.160:4500:tls  (reachable)
  10.X.Y.145:4500:tls  (reachable)

All the coords are reachable. How can I get out of the locking coordination state? Thanks.

Leo

lehu · May 18, 2022, 12:27am

We are running FDB v6.2.27.

Both the source and destination clusters are 2-region/3-DC clusters. Not sure if DR works between 2-region clusters.

So our current 3-DC clusters have redundancy already. Our purpose to use DR is for migrating FDB clusters to another platform with min downtime and for setting up an additional location for applications to read (using READ_LOCK_AWARE option).

osamarin · May 18, 2022, 5:57am

Locking coordination state usually means that not all tlog servers available.

lehu · May 18, 2022, 7:52pm

Oleg, good hint. I looked into the topology reports of the destination cluster before and after the DR, and found out that, in the after report, many roles disappeared and they affected all 3 DCs.

The BEFORE report:

FDB PROCESS Breakdown (Count) by DC and ROLE

 CNT   DC-Role
----  -------------
   1  dc1  cluster_controller
   1  dc1  data_distributor
  10  dc1  log
   1  dc1  master
   2  dc1  proxy
   1  dc1  ratekeeper
   1  dc1  resolver
  48  dc1  storage

  10  dc2  log

  10  dc3  log
   9  dc3  router
  48  dc3  storage

The AFTER report:

FDB PROCESS Breakdown (Count) by DC and ROLE

 CNT   DC-Role
----  -------------
   1  dc1  cluster_controller
   1  dc1  data_distributor
   1  dc1  master
   1  dc1  ratekeeper

   3  dc2  coordinator

   3  dc3  coordinator

The above reports were produced from the status JSON outputs by a script.

When I looked into the status json files. The before-json had a configuration section.

        "configuration" : {
            "coordinators_count" : 9,
            "excluded_servers" : [
            ],
            "log_spill" : 2,
            "logs" : 30,
            "proxies" : 2,
            "redundancy_mode" : "triple",
            "regions" : [
                {
                    "datacenters" : [
                        {
                            "id" : "dc1",
                            "priority" : 2
                        },
                        {
                            "id" : "dc2",
                            "priority" : 0,
                            "satellite" : 1
                        }
                    ],
                    "satellite_logs" : 10,
                    "satellite_redundancy_mode" : "one_satellite_double"
                },
                {
                    "datacenters" : [
                        {
                            "id" : "dc3",
                            "priority" : 1
                        }
                    ]
                }
            ],
            "storage_engine" : "ssd-2",
            "usable_regions" : 2
        },

However, the entire configuration section is gone in the after-json.

This is unexpected. It seems to me that DR has conflicts with the destination cluster’s 2-region/3-DC architecture.

@osamarin Do you have any idea? Thanks.

Topic		Replies	Views
"Locking coordination state" after losing a AZ on three_data_hall Using FoundationDB	1	237	February 28, 2024
Cluster stuck with status "Locked coordination state" even when all coordination servers available Using FoundationDB	2	714	February 4, 2021
Could not communicate with a quorum of coordination servers Using FoundationDB	2	2223	March 5, 2020
'Locking coordination' state after process removal Using FoundationDB	7	2097	July 11, 2019
Locking coordination state. Verify that a majority of coordinattion server process are active. Single machine Using FoundationDB	4	1171	March 8, 2021

Locking coordination state with DR

Related topics