killertypo
(Mike McMahon)
January 15, 2019, 10:56pm
1
We are slowly migrating our fleet from 3.x to 5.x and I have run into an issue when trying to assign new coordinators. The error message is as such:
ERROR: One of the specified coordinators is unreachable
(which isn’t incredibly helpful as it does not specify which one it is)
However the coordinators I am assigning are totally reachable, in fact the cluster recognizes them as healthy and a member of the current acting cluster.
Has anyone run into this before? or know which trace files i should begin looking at to ascertain what is causing the coordination state to not change?
I have checked the master, cluster_controller, and the coordinators during the requested change and nothing is outstanding that hints as to why it will not update the coordinators
killertypo
(Mike McMahon)
January 15, 2019, 10:57pm
2
here is some dump from fdbcli
Configuration:
Redundancy mode - double
Storage engine - ssd-1
Coordinators - 3
Exclusions - 7 (type `exclude' for details)
Cluster:
FoundationDB processes - 13 (less 7 excluded; 0 with errors)
Machines - 10 (less 7 excluded)
Memory availability - 16.4 GB per process on machine with least available
Retransmissions rate - 1 Hz
Fault Tolerance - 1 machine
Server time - 01/15/19 22:56:56
Data:
Replication health - Healthy (Removing storage server)
Moving data - 201.377 GB
Sum of key-value sizes - 245.354 GB
Disk space used - 852.228 GB
Operating space:
Storage server - 382.4 GB free on most full server
Log server - 933.0 GB free on most full server
Workload:
Read rate - 781 Hz
Write rate - 312 Hz
Transactions started - 97 Hz
Transactions committed - 18 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
10.6.0.6:4500 ( 13% cpu; 11% machine; 0.035 Gbps; 20% disk IO; 2.9 GB / 17.1 GB RAM )
10.6.0.7:4500 ( 39% cpu; 15% machine; 0.124 Gbps; 21% disk IO; 2.8 GB / 21.4 GB RAM )
10.6.0.7:4501 ( 39% cpu; 15% machine; 0.124 Gbps; 21% disk IO; 2.8 GB / 21.4 GB RAM )
10.6.0.22:4500 ( 15% cpu; 8% machine; 0.034 Gbps; 29% disk IO; 3.0 GB / 16.4 GB RAM )
10.6.0.26:4500 ( 39% cpu; 18% machine; 0.152 Gbps; 25% disk IO; 2.8 GB / 21.5 GB RAM )
10.6.0.26:4501 ( 53% cpu; 18% machine; 0.152 Gbps; 25% disk IO; 2.8 GB / 21.5 GB RAM )
10.6.0.30:4500 ( 13% cpu; 10% machine; 0.044 Gbps; 16% disk IO; 3.0 GB / 16.8 GB RAM )
10.6.0.37:4500 ( 17% cpu; 11% machine; 0.042 Gbps; 27% disk IO; 3.0 GB / 17.0 GB RAM )
10.6.0.40:4500 ( 15% cpu; 10% machine; 0.029 Gbps; 41% disk IO; 3.0 GB / 16.6 GB RAM )
10.6.0.41:4500 ( 44% cpu; 19% machine; 0.152 Gbps; 26% disk IO; 2.8 GB / 21.2 GB RAM )
10.6.0.41:4501 ( 55% cpu; 19% machine; 0.152 Gbps; 26% disk IO; 2.8 GB / 21.2 GB RAM )
10.6.0.46:4500 ( 40% cpu; 17% machine; 0.142 Gbps; 64% disk IO; 2.9 GB / 16.7 GB RAM )
10.6.0.58:4500 ( 16% cpu; 9% machine; 0.038 Gbps; 45% disk IO; 2.8 GB / 16.6 GB RAM )
Coordination servers:
10.6.0.22:4500 (reachable)
10.6.0.37:4500 (reachable)
10.6.0.40:4500 (reachable)
Client time: 01/15/19 22:56:56
fdb> coordinators 10.6.0.7:4500 10.6.0.26:4500 10.6.0.41:4500
WARNING: Long delay (Ctrl-C to interrupt)
ERROR: One of the specified coordinators is unreachable
fdb>
panghy
(Clement Pang)
January 16, 2019, 1:05am
3
Internal issue to us (meaning my code =p). Nothing to see here.
1 Like
alexmiller
(Alex Miller)
January 16, 2019, 1:13am
4
You’re welcome to file an issue for this if you’d like to see it change.
killertypo
(Mike McMahon)
January 16, 2019, 9:42pm
5
@alexmiller
I did some digging into our custom 5.x branch and what is on the most current 5.2 release in github and found that a fresh checkout of release-5.2 i am unable to change my coordinators,
I tracked it down to this commit https://github.com/apple/foundationdb/commit/b8486d4a2ae1fbffac58f11dbaec272a5d24d92f
once I remove the changes to the coordinator election process i am able to again able to change my coordinators
another note is that the release-5.2 branch is running an unreleased 5.2.8-PRERELEASE version vs. the release page of 5.2.6
1 Like
killertypo
(Mike McMahon)
January 16, 2019, 10:11pm
6
ah found the issue was fixed in the 6x branch (never ported to the 5x branch) we should be able to proceed now.