FoundationDB

EndpointNotFound in trace when configure coordinator


(Hu Sheng) #1

Hey FDB community:
Here is the issue when I try to configure or change the coordinators in FDB(5.2) cluster.

Environment
I have two physical machines within 8 processes which are deployed as below:

ip            port    cpu%  mem%  iops  net  class          roles
------------  ------  ----  ----  ----  ---  -------------  --------------------
 10.0.1.101    4500    1     9     -     0    transaction
               4501    0     9     -     0    transaction
               4502    0     6     -     0    stateless
               4503    1     6     -     0    stateless      proxy,resolver
------------  ------  ----  ----  ----  ---  -------------  --------------------
 10.0.1.102    4500    1     8     25    0    transaction    log
               4501    1     9     25    0    transaction    log,storage
               4502    1     5     -     0    stateless      master
               4503    2     5     -     1    stateless      cluster_controller
------------  ------  ----  ----  ----  ---  -------------  --------------------

When I try to change the cluster coordinators via command

 fdbcli --log --exec 'coordinators 10.0.1.101:4500 10.0.1.102:4501'

there would always have a EndpointNotFound trace there.

<Event Severity="10" Time="1542789506.217702" Type="AttemptingQuorumChange" Machine="10.0.1.101:16725" ID="0000000000000000" FromCS="r8qDWv3pEGC4Fu:ErmTJaj5WecEpU1dNuLMrRlvXnK2ft6t@10.0.1.101:4501,10.0.1.101:4502" ToCS="r8qDWv3pEGC4Fu:lnBGOLJ1DD6ydcnPEACtF0lArLcBv1w2@10.0.1.101:4500,10.0.1.101:4501" logGroup="default"/>
<Event Severity="10" Time="1542789506.217702" Type="CodeCoverage" Machine="10.0.1.101:16725" ID="0000000000000000" File="fdbclient/ManagementAPI.actor.cpp" Line="686" Condition="old.clusterKeyName() == conn.clusterKeyName()" logGroup="default"/>
<Event Severity="10" Time="1542789506.223479" Type="GetLeaderReply" Machine="10.0.1.101:16725" ID="0000000000000000" Coordinator="10.0.1.101:4501" Nominee="0000000000000000" Generation="0" logGroup="default"/>
<Event Severity="10" Time="1542789506.223479" Type="MonitorLeaderChange" Machine="10.0.1.101:16725" ID="0000000000000000" NewLeader="0000000000000000" logGroup="default"/>
<Event Severity="10" Time="1542789506.223479" Type="MonitorLeaderForwarding" Machine="10.0.1.101:16725" ID="0000000000000000" NewConnStr="r8qDWv3pEGC4Fu:lnBGOLJ1DD6ydcnPEACtF0lArLcBv1w2@10.0.1.101:4500,10.0.1.101:4501" OldConnStr="r8qDWv3pEGC4Fu:ErmTJaj5WecEpU1dNuLMrRlvXnK2ft6t@10.0.1.101:4501,10.0.1.101:4502" logGroup="default"/>
<Event Severity="10" Time="1542789506.223479" Type="MonitorLeaderChange" Machine="10.0.1.101:16725" ID="0000000000000000" NewLeader="0000000000000001" logGroup="default"/>
<Event Severity="10" Time="1542789506.229048" Type="EndpointNotFound" Machine="10.0.1.101:16725" ID="0000000000000000" Address="10.0.1.101:4503" Token="0197121895a9e885" SuppressedEventCount="0" logGroup="default"/>
<Event Severity="10" Time="1542789506.229048" Type="CodeCoverage" Machine="10.0.1.101:16725" ID="0000000000000000" File="fdbclient/NativeAPI.actor.cpp" Line="2441" Condition="true" logGroup="default"/>
<Event Severity="10" Time="1542789506.229048" Type="CommitDummyTransaction" Machine="10.0.1.101:16725" ID="0000000000000000" Key="\xff/coordinators" Retries="0" logGroup="default"/>
<Event Severity="10" Time="1542789506.229313" Type="AllAlternativesFailed" Machine="10.0.1.101:16725" ID="0000000000000000" Interval="1" ServersValidTime="1e+99" Alternatives="2 0197121895a9e885" Delay="1e+99" logGroup="default"/>
<Event Severity="10" Time="1542789506.268223" Type="GetLeaderReply" Machine="10.0.1.101:16725" ID="0000000000000000" Coordinator="10.0.1.101:4500" Nominee="1a8613a19fa6ba4d" Generation="1" logGroup="default"/>
<Event Severity="10" Time="1542789506.268223" Type="MonitorLeaderChange" Machine="10.0.1.101:16725" ID="0000000000000000" NewLeader="0000000000000001" logGroup="default"/>
<Event Severity="10" Time="1542789506.268598" Type="GetLeaderReply" Machine="10.0.1.101:16725" ID="0000000000000000" Coordinator="10.0.1.101:4501" Nominee="1a8613a19fa6ba4d" Generation="1" logGroup="default"/>
<Event Severity="10" Time="1542789506.268598" Type="MonitorLeaderChange" Machine="10.0.1.101:16725" ID="0000000000000000" NewLeader="1a8613a19fa6ba4d" logGroup="default"/>
<Event Severity="10" Time="1542789506.268598" Type="ClientInfo_CCInterfaceChange" Machine="10.0.1.101:16725" ID="0000000000000000" CCID="692856b61a6df70d" logGroup="default"/>
<Event Severity="10" Time="1542789506.272561" Type="ClientInfoChange" Machine="10.0.1.101:16725" ID="0000000000000000" ChangeID="0000000000000000" logGroup="default"/>
<Event Severity="10" Time="1542789506.408692" Type="ClientInfoChange" Machine="10.0.1.101:16725" ID="0000000000000000" ChangeID="95f09dd6ac1d4e2d" logGroup="default"/>
<Event Severity="10" Time="1542789506.413940" Type="RetryQuorumChange" Machine="10.0.1.101:16725" ID="0000000000000000" Error="commit_unknown_result" ErrorDescription="Transaction may or may not have committed" ErrorCode="1021" Retries="0" logGroup="default"/>

It happens every time, but the operation would success on the other hand, is this a bug? BTW, how can FDB cluster broadcast the new coordinator information to every node of the cluster? Is there a tech detail of this process or the coordinator part? The architecture shows little except the PAXOS on coordinatiors, for example, what exactly it stores, Thanks!


(Hu Sheng) #2

I try to find out the answer by myself via reading the Flow based source code, although I find some out of them, but still can’t connect them together. Here are some of the code snippets:

  1. When fdbcli starts, there would be a loop function called monitorLeaderOneGeneration which will ask each of the coordinators to get the current cluster leader(cluster_controller) and if the leader info changed, it will save the changed connection string into local cluster file.

  2. When execute command coordinators <ip>:<port> <ip>:<port>, the process will final invoke method ‘changeQuorum’ inside which, it will assemble a new connection string and then ask those coordinators to get the new leader info:

ClientCoordinators coord( Reference<ClusterConnectionFile>( new ClusterConnectionFile( conn ) ) );
			for( int i = 0; i < coord.clientLeaderServers.size(); i++ )
				leaderServers.push_back(
					 retryBrokenPromise( 
						 coord.clientLeaderServers[i].getLeader, 
						 GetLeaderRequest( coord.clusterKey, UID() ), 
						 TaskCoordinationReply ) );

			choose {
				when( wait( waitForAll( leaderServers ) ) ) {}
				when( wait( delay(5.0) ) ) {
					return CoordinatorsResult::COORDINATOR_UNREACHABLE;
				}
			}

if all of them successfully returned, commit the new value into system record.

So from the codes above, there still has one key point missing, how can the fdbcli info the cluster to trigger a new leader selection as well as deliver the new connection string to the cluster?