Locking coordination state with DR

I set up a DR between two fdb clusters. The initial sync was slow with the 400GB db. After a day or two, things got stuck.

At the destination cluster, I got this:

Locking coordination state. Verify that a majority of coordination server
processes are active.

  10.X.Y.126:4500:tls  (reachable)
  10.X.Y.150:4500:tls  (reachable)
  10.X.Y.119:4000:tls  (reachable)
  10.X.Y.23:4500:tls  (reachable)
  10.X.Y.20:4000:tls  (reachable)
  10.X.Y.173:4000:tls  (reachable)
  10.X.Y.111:4500:tls  (reachable)
  10.X.Y.160:4500:tls  (reachable)
  10.X.Y.145:4500:tls  (reachable)

All the coords are reachable. How can I get out of the locking coordination state? Thanks.

Leo

We are running FDB v6.2.27.

Both the source and destination clusters are 2-region/3-DC clusters. Not sure if DR works between 2-region clusters.

So our current 3-DC clusters have redundancy already. Our purpose to use DR is for migrating FDB clusters to another platform with min downtime and for setting up an additional location for applications to read (using READ_LOCK_AWARE option).

Locking coordination state usually means that not all tlog servers available.

Oleg, good hint. I looked into the topology reports of the destination cluster before and after the DR, and found out that, in the after report, many roles disappeared and they affected all 3 DCs.

The BEFORE report:

FDB PROCESS Breakdown (Count) by DC and ROLE

 CNT   DC-Role
----  -------------
   1  dc1  cluster_controller
   1  dc1  data_distributor
  10  dc1  log
   1  dc1  master
   2  dc1  proxy
   1  dc1  ratekeeper
   1  dc1  resolver
  48  dc1  storage

  10  dc2  log

  10  dc3  log
   9  dc3  router
  48  dc3  storage

The AFTER report:

FDB PROCESS Breakdown (Count) by DC and ROLE

 CNT   DC-Role
----  -------------
   1  dc1  cluster_controller
   1  dc1  data_distributor
   1  dc1  master
   1  dc1  ratekeeper

   3  dc2  coordinator

   3  dc3  coordinator

The above reports were produced from the status JSON outputs by a script.

When I looked into the status json files. The before-json had a configuration section.

        "configuration" : {
            "coordinators_count" : 9,
            "excluded_servers" : [
            ],
            "log_spill" : 2,
            "logs" : 30,
            "proxies" : 2,
            "redundancy_mode" : "triple",
            "regions" : [
                {
                    "datacenters" : [
                        {
                            "id" : "dc1",
                            "priority" : 2
                        },
                        {
                            "id" : "dc2",
                            "priority" : 0,
                            "satellite" : 1
                        }
                    ],
                    "satellite_logs" : 10,
                    "satellite_redundancy_mode" : "one_satellite_double"
                },
                {
                    "datacenters" : [
                        {
                            "id" : "dc3",
                            "priority" : 1
                        }
                    ]
                }
            ],
            "storage_engine" : "ssd-2",
            "usable_regions" : 2
        },

However, the entire configuration section is gone in the after-json.

This is unexpected. It seems to me that DR has conflicts with the destination cluster’s 2-region/3-DC architecture.

@osamarin Do you have any idea? Thanks.