i’m using foundationdb-7.1.26, i found that if one machine in cluster file with hostname can’t be resolved,other machines in cluster file with hostname will all meet problems in resolving hostnames,although they have the correct hostname.
For example, if we write correct_hostname:4500,wrong_hostname:4500 in cluster file,where correct_hostname can be correctly resolved by DNS service while wrong_hostname cannot, fdb will respond that correct_hostname:4500 is unreachable. But if we just exchange wrong_hostname for a random ip which is not exist, like 1.2.3.4 , correct_hostname:4500 becomes reachable.
Deeply digged into code, I have found that if the correct_hostname can be resolved correctly,but just after resolving wrong_hostname throwing error, the resolving of correct_hostname will also throw error ,which is ‘Asynchronous operation cancelled’
it seems that
Check in the new Hostname logic. by RenxuanW · Pull Request #6926 · apple/foundationdb (github.com)
has solved the problem of availibility caused by hostnames, but i do met the strange problem in 7.1.26.
I’ve tried compiling from source and print some infomations to see what happens.
[root@a3325e0c60c0 bin]# ./fdbcli --exec "status details"
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
createDatabase 1
createDatabase 2
createDatabase 3
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
DatabaseContext* db;
return database;
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
createDatabase 1
createDatabase 2
createDatabase 3
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
DatabaseContext* db;
return database;
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
monitorLeaderOneGeneration monitorNominee
monitorLeaderOneGeneration monitorNominee
StatusClient::statusFetcher
statusFetcherImpl
clientStatusFetcher
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
Using cluster file `/etc/foundationdb/fdb.cluster'.
Could not communicate with a quorum of coordination servers:
172.17.0.2:4500 (reachable)
1.2.3.4:4500 (unreachable)
wrong hostname makes the correct hostname unresolveable
[root@a3325e0c60c0 bin]# ./fdbcli --exec "status details"
ClusterConnectionString
hostname:a3325e0c60c0:4500
hostname:test:4500
createDatabase 1
createDatabase 2
createDatabase 3
ClusterConnectionString
hostname:a3325e0c60c0:4500
hostname:test:4500
tryGetReplyFromHostname 2,hostname:a3325e0c60c0:4500
resolveTCPEndpoint_impl hostname:a3325e0c60c0
DatabaseContext* db;
return database;
ClusterConnectionString
hostname:a3325e0c60c0:4500
hostname:test:4500
host a3325e0c60c0 start resolve
host a3325e0c60c0 start iter
resolveTCPEndpoint_impl addr:172.17.0.2
host a3325e0c60c0 finish iter empty:0
resolve hostname:a3325e0c60c0:4500->172.17.0.2:4500(fromHostname)
createDatabase 1
createDatabase 2
createDatabase 3
ClusterConnectionString
hostname:a3325e0c60c0:4500
hostname:test:4500
tryGetReplyFromHostname 2,hostname:test:4500
resolveTCPEndpoint_impl hostname:test
DatabaseContext* db;
return database;
monitorLeaderOneGeneration monitorNominee
monitorNominee hostname:a3325e0c60c0:4500
retryGetReplyFromHostname 2
resolveTCPEndpoint_impl hostname:a3325e0c60c0
monitorLeaderOneGeneration monitorNominee
monitorNominee hostname:test:4500
retryGetReplyFromHostname 2
resolveTCPEndpoint_impl hostname:test
StatusClient::statusFetcher
statusFetcherImpl
clientStatusFetcher
clientCoordinatorsStatusFetcher 1 hostname:a3325e0c60c0:4500
retryGetReplyFromHostname 2
resolveTCPEndpoint_impl hostname:a3325e0c60c0
clientCoordinatorsStatusFetcher 1 hostname:test:4500
retryGetReplyFromHostname 2
resolveTCPEndpoint_impl hostname:test
clientCoordinatorsStatusFetcher 2 hostname:a3325e0c60c0:4500
retryGetReplyFromHostname 1
resolveTCPEndpoint_impl hostname:a3325e0c60c0
clientCoordinatorsStatusFetcher 2 hostname:test:4500
retryGetReplyFromHostname 1
resolveTCPEndpoint_impl hostname:test
wait result for hostname a3325e0c60c0 error,Asynchronous operation cancelled
cannot resolve hostname:a3325e0c60c0:4500
wait result for hostname test error,Asynchronous operation cancelled
cannot resolve hostname:test:4500
wait result for hostname a3325e0c60c0 error,Asynchronous operation cancelled
cannot resolve hostname:a3325e0c60c0:4500
wait result for hostname test error,Asynchronous operation cancelled
cannot resolve hostname:test:4500
ClusterConnectionString
hostname:a3325e0c60c0:4500
hostname:test:4500
Using cluster file `/etc/foundationdb/fdb.cluster'.
Could not communicate with a quorum of coordination servers:
a3325e0c60c0:4500 (unreachable)
test:4500 (unreachable)
update:
tcp resolver will report error “Host not found (authoritative)” of the wrong hostname and "Operation canceled"of the correct hostname
maybe the cause it that each actor use one same io service, concurrently resolving and the wrong hostname thread cancels the correct hostname thread?
state tcp::resolver tcpResolver(self->reactor.ios);
...
tcpResolver.cancel();
Using resolveBlocking instead can solve the problem.
ACTOR Future<NetworkAddress> resolveWithRetryImpl(Hostname* self) {
state double resolveInterval = FLOW_KNOBS->HOSTNAME_RESOLVE_INIT_INTERVAL;
loop {
try {
Optional<NetworkAddress> address = self->resolveBlocking();
ACTOR template <class Req>
Future<ErrorOr<REPLY_TYPE(Req)>> tryGetReplyFromHostname(Req request,
Hostname hostname,
WellKnownEndpoints token,
TaskPriority taskID) {
// A wrapper of tryGetReply(request), except that the request is sent to an address resolved from a hostname.
// If resolving fails, return lookup_failed().
// Otherwise, return tryGetReply(request).
Optional<NetworkAddress> address = hostname.resolveBlocking();