Unable to change coordinators

We are slowly migrating our fleet from 3.x to 5.x and I have run into an issue when trying to assign new coordinators. The error message is as such:

ERROR: One of the specified coordinators is unreachable (which isn’t incredibly helpful as it does not specify which one it is)

However the coordinators I am assigning are totally reachable, in fact the cluster recognizes them as healthy and a member of the current acting cluster.

Has anyone run into this before? or know which trace files i should begin looking at to ascertain what is causing the coordination state to not change?

I have checked the master, cluster_controller, and the coordinators during the requested change and nothing is outstanding that hints as to why it will not update the coordinators

here is some dump from fdbcli

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-1
  Coordinators           - 3
  Exclusions             - 7 (type `exclude' for details)

Cluster:
  FoundationDB processes - 13 (less 7 excluded; 0 with errors)
  Machines               - 10 (less 7 excluded)
  Memory availability    - 16.4 GB per process on machine with least available
  Retransmissions rate   - 1 Hz
  Fault Tolerance        - 1 machine
  Server time            - 01/15/19 22:56:56

Data:
  Replication health     - Healthy (Removing storage server)
  Moving data            - 201.377 GB
  Sum of key-value sizes - 245.354 GB
  Disk space used        - 852.228 GB

Operating space:
  Storage server         - 382.4 GB free on most full server
  Log server             - 933.0 GB free on most full server

Workload:
  Read rate              - 781 Hz
  Write rate             - 312 Hz
  Transactions started   - 97 Hz
  Transactions committed - 18 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.6.0.6:4500          ( 13% cpu; 11% machine; 0.035 Gbps; 20% disk IO; 2.9 GB / 17.1 GB RAM  )
  10.6.0.7:4500          ( 39% cpu; 15% machine; 0.124 Gbps; 21% disk IO; 2.8 GB / 21.4 GB RAM  )
  10.6.0.7:4501          ( 39% cpu; 15% machine; 0.124 Gbps; 21% disk IO; 2.8 GB / 21.4 GB RAM  )
  10.6.0.22:4500         ( 15% cpu;  8% machine; 0.034 Gbps; 29% disk IO; 3.0 GB / 16.4 GB RAM  )
  10.6.0.26:4500         ( 39% cpu; 18% machine; 0.152 Gbps; 25% disk IO; 2.8 GB / 21.5 GB RAM  )
  10.6.0.26:4501         ( 53% cpu; 18% machine; 0.152 Gbps; 25% disk IO; 2.8 GB / 21.5 GB RAM  )
  10.6.0.30:4500         ( 13% cpu; 10% machine; 0.044 Gbps; 16% disk IO; 3.0 GB / 16.8 GB RAM  )
  10.6.0.37:4500         ( 17% cpu; 11% machine; 0.042 Gbps; 27% disk IO; 3.0 GB / 17.0 GB RAM  )
  10.6.0.40:4500         ( 15% cpu; 10% machine; 0.029 Gbps; 41% disk IO; 3.0 GB / 16.6 GB RAM  )
  10.6.0.41:4500         ( 44% cpu; 19% machine; 0.152 Gbps; 26% disk IO; 2.8 GB / 21.2 GB RAM  )
  10.6.0.41:4501         ( 55% cpu; 19% machine; 0.152 Gbps; 26% disk IO; 2.8 GB / 21.2 GB RAM  )
  10.6.0.46:4500         ( 40% cpu; 17% machine; 0.142 Gbps; 64% disk IO; 2.9 GB / 16.7 GB RAM  )
  10.6.0.58:4500         ( 16% cpu;  9% machine; 0.038 Gbps; 45% disk IO; 2.8 GB / 16.6 GB RAM  )

Coordination servers:
  10.6.0.22:4500  (reachable)
  10.6.0.37:4500  (reachable)
  10.6.0.40:4500  (reachable)

Client time: 01/15/19 22:56:56

fdb> coordinators 10.6.0.7:4500 10.6.0.26:4500 10.6.0.41:4500

WARNING: Long delay (Ctrl-C to interrupt)
ERROR: One of the specified coordinators is unreachable
fdb>

Internal issue to us (meaning my code =p). Nothing to see here.

1 Like

You’re welcome to file an issue for this if you’d like to see it change. :slight_smile:

@alexmiller

I did some digging into our custom 5.x branch and what is on the most current 5.2 release in github and found that a fresh checkout of release-5.2 i am unable to change my coordinators,

I tracked it down to this commit https://github.com/apple/foundationdb/commit/b8486d4a2ae1fbffac58f11dbaec272a5d24d92f

once I remove the changes to the coordinator election process i am able to again able to change my coordinators

another note is that the release-5.2 branch is running an unreleased 5.2.8-PRERELEASE version vs. the release page of 5.2.6

1 Like

ah found the issue was fixed in the 6x branch (never ported to the 5x branch) we should be able to proceed now.