Strange error of resolving hostnames

szy0127 · February 13, 2023, 5:14am

i’m using foundationdb-7.1.26, i found that if one machine in cluster file with hostname can’t be resolved,other machines in cluster file with hostname will all meet problems in resolving hostnames,although they have the correct hostname.
For example, if we write correct_hostname:4500,wrong_hostname:4500 in cluster file,where correct_hostname can be correctly resolved by DNS service while wrong_hostname cannot, fdb will respond that correct_hostname:4500 is unreachable. But if we just exchange wrong_hostname for a random ip which is not exist, like 1.2.3.4 , correct_hostname:4500 becomes reachable.
Deeply digged into code, I have found that if the correct_hostname can be resolved correctly,but just after resolving wrong_hostname throwing error, the resolving of correct_hostname will also throw error ,which is ‘Asynchronous operation cancelled’
it seems that
Check in the new Hostname logic. by RenxuanW · Pull Request #6926 · apple/foundationdb (github.com)
has solved the problem of availibility caused by hostnames, but i do met the strange problem in 7.1.26.
I’ve tried compiling from source and print some infomations to see what happens.

[root@a3325e0c60c0 bin]# ./fdbcli --exec "status details"
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
createDatabase 1
createDatabase 2
createDatabase 3
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
DatabaseContext* db;
return database;
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
createDatabase 1
createDatabase 2
createDatabase 3
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
DatabaseContext* db;
return database;
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
monitorLeaderOneGeneration monitorNominee
monitorLeaderOneGeneration monitorNominee
StatusClient::statusFetcher
statusFetcherImpl
clientStatusFetcher
ClusterConnectionString
coord:172.17.0.2:4500
coord:1.2.3.4:4500
Using cluster file `/etc/foundationdb/fdb.cluster'.

Could not communicate with a quorum of coordination servers:
  172.17.0.2:4500  (reachable)
  1.2.3.4:4500  (unreachable)

wrong hostname makes the correct hostname unresolveable

[root@a3325e0c60c0 bin]# ./fdbcli --exec "status details"
ClusterConnectionString
hostname:a3325e0c60c0:4500
hostname:test:4500
createDatabase 1
createDatabase 2
createDatabase 3
ClusterConnectionString
hostname:a3325e0c60c0:4500
hostname:test:4500
tryGetReplyFromHostname 2,hostname:a3325e0c60c0:4500
resolveTCPEndpoint_impl hostname:a3325e0c60c0
DatabaseContext* db;
return database;
ClusterConnectionString
hostname:a3325e0c60c0:4500
hostname:test:4500
host a3325e0c60c0 start resolve
host a3325e0c60c0 start iter
resolveTCPEndpoint_impl addr:172.17.0.2
host a3325e0c60c0 finish iter empty:0
resolve hostname:a3325e0c60c0:4500->172.17.0.2:4500(fromHostname)
createDatabase 1
createDatabase 2
createDatabase 3
ClusterConnectionString
hostname:a3325e0c60c0:4500
hostname:test:4500
tryGetReplyFromHostname 2,hostname:test:4500
resolveTCPEndpoint_impl hostname:test
DatabaseContext* db;
return database;
monitorLeaderOneGeneration monitorNominee
monitorNominee hostname:a3325e0c60c0:4500
retryGetReplyFromHostname 2
resolveTCPEndpoint_impl hostname:a3325e0c60c0
monitorLeaderOneGeneration monitorNominee
monitorNominee hostname:test:4500
retryGetReplyFromHostname 2
resolveTCPEndpoint_impl hostname:test
StatusClient::statusFetcher
statusFetcherImpl
clientStatusFetcher
clientCoordinatorsStatusFetcher 1 hostname:a3325e0c60c0:4500
retryGetReplyFromHostname 2
resolveTCPEndpoint_impl hostname:a3325e0c60c0
clientCoordinatorsStatusFetcher 1 hostname:test:4500
retryGetReplyFromHostname 2
resolveTCPEndpoint_impl hostname:test
clientCoordinatorsStatusFetcher 2 hostname:a3325e0c60c0:4500
retryGetReplyFromHostname 1
resolveTCPEndpoint_impl hostname:a3325e0c60c0
clientCoordinatorsStatusFetcher 2 hostname:test:4500
retryGetReplyFromHostname 1
resolveTCPEndpoint_impl hostname:test
wait result for hostname a3325e0c60c0 error,Asynchronous operation cancelled
cannot resolve hostname:a3325e0c60c0:4500
wait result for hostname test error,Asynchronous operation cancelled
cannot resolve hostname:test:4500
wait result for hostname a3325e0c60c0 error,Asynchronous operation cancelled
cannot resolve hostname:a3325e0c60c0:4500
wait result for hostname test error,Asynchronous operation cancelled
cannot resolve hostname:test:4500
ClusterConnectionString
hostname:a3325e0c60c0:4500
hostname:test:4500
Using cluster file `/etc/foundationdb/fdb.cluster'.

Could not communicate with a quorum of coordination servers:
  a3325e0c60c0:4500  (unreachable)
  test:4500  (unreachable)

update:
tcp resolver will report error “Host not found (authoritative)” of the wrong hostname and "Operation canceled"of the correct hostname
maybe the cause it that each actor use one same io service, concurrently resolving and the wrong hostname thread cancels the correct hostname thread?

state tcp::resolver tcpResolver(self->reactor.ios);
...
tcpResolver.cancel();

Using resolveBlocking instead can solve the problem.

ACTOR Future<NetworkAddress> resolveWithRetryImpl(Hostname* self) {
	state double resolveInterval = FLOW_KNOBS->HOSTNAME_RESOLVE_INIT_INTERVAL;
	loop {
		try {
			Optional<NetworkAddress> address = self->resolveBlocking();

ACTOR template <class Req>
Future<ErrorOr<REPLY_TYPE(Req)>> tryGetReplyFromHostname(Req request,
                                                         Hostname hostname,
                                                         WellKnownEndpoints token,
                                                         TaskPriority taskID) {
	// A wrapper of tryGetReply(request), except that the request is sent to an address resolved from a hostname.
	// If resolving fails, return lookup_failed().
	// Otherwise, return tryGetReply(request).
	Optional<NetworkAddress> address = hostname.resolveBlocking();

szy0127 · February 15, 2023, 5:35am

use resolveBlocking can prevent meeting the problem

diff --git a/foundationdb-7.1.26/fdbrpc/genericactors.actor.h b/foundationdb-7.1.26/fdbrpc/genericactors.actor.h
index 80f51d3..f2ccbbf 100644
--- a/foundationdb-7.1.26/fdbrpc/genericactors.actor.h
+++ b/foundationdb-7.1.26/fdbrpc/genericactors.actor.h
@@ -112,7 +112,7 @@ Future<ErrorOr<REPLY_TYPE(Req)>> tryGetReplyFromHostname(Req request,
        // A wrapper of tryGetReply(request), except that the request is sent to an address resolved from a hostname.
        // If resolving fails, return lookup_failed().
        // Otherwise, return tryGetReply(request).
-       Optional<NetworkAddress> address = wait(hostname.resolve());
+       Optional<NetworkAddress> address = hostname.resolveBlocking();
        if (!address.present()) {
                return ErrorOr<REPLY_TYPE(Req)>(lookup_failed());
        }
diff --git a/foundationdb-7.1.26/flow/Hostname.actor.cpp b/foundationdb-7.1.26/flow/Hostname.actor.cpp
index 84a3cc7..fa1e420 100644
--- a/foundationdb-7.1.26/flow/Hostname.actor.cpp
+++ b/foundationdb-7.1.26/flow/Hostname.actor.cpp
@@ -59,7 +59,7 @@ ACTOR Future<NetworkAddress> resolveWithRetryImpl(Hostname* self) {
        state double resolveInterval = FLOW_KNOBS->HOSTNAME_RESOLVE_INIT_INTERVAL;
        loop {
                try {
-                       Optional<NetworkAddress> address = wait(resolveImpl(self));
+                       Optional<NetworkAddress> address = self->resolveBlocking();
                        if (address.present()) {
                                return address.get();
                        }

Szy0127/foundationdb: FoundationDB - the open source, distributed, transactional key-value store (github.com)

szy0127 · February 16, 2023, 3:08pm

I found that even run the same code in the same environment provided by a same docker image, some machines cannot reproduce this bug while some can.

Topic		Replies	Views
useDNSInClusterFile - Connection string invalid Using FoundationDB operator	12	1465	January 23, 2023
Issue with redundancy settings Using FoundationDB	1	43	February 13, 2025
Cluster_controller process, its ApplicationLoadBalancer domain name can't be resolved by DNS server Using FoundationDB operator	1	344	September 3, 2022
Changing public address of (non-coordinator) fdbserver corrupted cluster Using FoundationDB	8	1878	March 16, 2021
Feature request: hostnames for coordinators FoundationDB Core	3	1696	October 2, 2019

Strange error of resolving hostnames

Related topics