Hey FDB community:
Here is the issue when I try to configure or change the coordinators in FDB(5.2) cluster.
Environment
I have two physical machines within 8 processes which are deployed as below:
ip port cpu% mem% iops net class roles
------------ ------ ---- ---- ---- --- ------------- --------------------
10.0.1.101 4500 1 9 - 0 transaction
4501 0 9 - 0 transaction
4502 0 6 - 0 stateless
4503 1 6 - 0 stateless proxy,resolver
------------ ------ ---- ---- ---- --- ------------- --------------------
10.0.1.102 4500 1 8 25 0 transaction log
4501 1 9 25 0 transaction log,storage
4502 1 5 - 0 stateless master
4503 2 5 - 1 stateless cluster_controller
------------ ------ ---- ---- ---- --- ------------- --------------------
When I try to change the cluster coordinators via command
fdbcli --log --exec 'coordinators 10.0.1.101:4500 10.0.1.102:4501'
there would always have a EndpointNotFound trace there.
<Event Severity="10" Time="1542789506.217702" Type="AttemptingQuorumChange" Machine="10.0.1.101:16725" ID="0000000000000000" FromCS="r8qDWv3pEGC4Fu:ErmTJaj5WecEpU1dNuLMrRlvXnK2ft6t@10.0.1.101:4501,10.0.1.101:4502" ToCS="r8qDWv3pEGC4Fu:lnBGOLJ1DD6ydcnPEACtF0lArLcBv1w2@10.0.1.101:4500,10.0.1.101:4501" logGroup="default"/>
<Event Severity="10" Time="1542789506.217702" Type="CodeCoverage" Machine="10.0.1.101:16725" ID="0000000000000000" File="fdbclient/ManagementAPI.actor.cpp" Line="686" Condition="old.clusterKeyName() == conn.clusterKeyName()" logGroup="default"/>
<Event Severity="10" Time="1542789506.223479" Type="GetLeaderReply" Machine="10.0.1.101:16725" ID="0000000000000000" Coordinator="10.0.1.101:4501" Nominee="0000000000000000" Generation="0" logGroup="default"/>
<Event Severity="10" Time="1542789506.223479" Type="MonitorLeaderChange" Machine="10.0.1.101:16725" ID="0000000000000000" NewLeader="0000000000000000" logGroup="default"/>
<Event Severity="10" Time="1542789506.223479" Type="MonitorLeaderForwarding" Machine="10.0.1.101:16725" ID="0000000000000000" NewConnStr="r8qDWv3pEGC4Fu:lnBGOLJ1DD6ydcnPEACtF0lArLcBv1w2@10.0.1.101:4500,10.0.1.101:4501" OldConnStr="r8qDWv3pEGC4Fu:ErmTJaj5WecEpU1dNuLMrRlvXnK2ft6t@10.0.1.101:4501,10.0.1.101:4502" logGroup="default"/>
<Event Severity="10" Time="1542789506.223479" Type="MonitorLeaderChange" Machine="10.0.1.101:16725" ID="0000000000000000" NewLeader="0000000000000001" logGroup="default"/>
<Event Severity="10" Time="1542789506.229048" Type="EndpointNotFound" Machine="10.0.1.101:16725" ID="0000000000000000" Address="10.0.1.101:4503" Token="0197121895a9e885" SuppressedEventCount="0" logGroup="default"/>
<Event Severity="10" Time="1542789506.229048" Type="CodeCoverage" Machine="10.0.1.101:16725" ID="0000000000000000" File="fdbclient/NativeAPI.actor.cpp" Line="2441" Condition="true" logGroup="default"/>
<Event Severity="10" Time="1542789506.229048" Type="CommitDummyTransaction" Machine="10.0.1.101:16725" ID="0000000000000000" Key="\xff/coordinators" Retries="0" logGroup="default"/>
<Event Severity="10" Time="1542789506.229313" Type="AllAlternativesFailed" Machine="10.0.1.101:16725" ID="0000000000000000" Interval="1" ServersValidTime="1e+99" Alternatives="2 0197121895a9e885" Delay="1e+99" logGroup="default"/>
<Event Severity="10" Time="1542789506.268223" Type="GetLeaderReply" Machine="10.0.1.101:16725" ID="0000000000000000" Coordinator="10.0.1.101:4500" Nominee="1a8613a19fa6ba4d" Generation="1" logGroup="default"/>
<Event Severity="10" Time="1542789506.268223" Type="MonitorLeaderChange" Machine="10.0.1.101:16725" ID="0000000000000000" NewLeader="0000000000000001" logGroup="default"/>
<Event Severity="10" Time="1542789506.268598" Type="GetLeaderReply" Machine="10.0.1.101:16725" ID="0000000000000000" Coordinator="10.0.1.101:4501" Nominee="1a8613a19fa6ba4d" Generation="1" logGroup="default"/>
<Event Severity="10" Time="1542789506.268598" Type="MonitorLeaderChange" Machine="10.0.1.101:16725" ID="0000000000000000" NewLeader="1a8613a19fa6ba4d" logGroup="default"/>
<Event Severity="10" Time="1542789506.268598" Type="ClientInfo_CCInterfaceChange" Machine="10.0.1.101:16725" ID="0000000000000000" CCID="692856b61a6df70d" logGroup="default"/>
<Event Severity="10" Time="1542789506.272561" Type="ClientInfoChange" Machine="10.0.1.101:16725" ID="0000000000000000" ChangeID="0000000000000000" logGroup="default"/>
<Event Severity="10" Time="1542789506.408692" Type="ClientInfoChange" Machine="10.0.1.101:16725" ID="0000000000000000" ChangeID="95f09dd6ac1d4e2d" logGroup="default"/>
<Event Severity="10" Time="1542789506.413940" Type="RetryQuorumChange" Machine="10.0.1.101:16725" ID="0000000000000000" Error="commit_unknown_result" ErrorDescription="Transaction may or may not have committed" ErrorCode="1021" Retries="0" logGroup="default"/>
It happens every time, but the operation would success on the other hand, is this a bug? BTW, how can FDB cluster broadcast the new coordinator information to every node of the cluster? Is there a tech detail of this process or the coordinator part? The architecture shows little except the PAXOS on coordinatiors, for example, what exactly it stores, Thanks!