When one machine becomes inaccessible from the cluster, what impact does cluster have?

hyejeong · December 13, 2023, 7:52am

Hi.

I have FDB Cluster with total 6 machines with three_data_hall. In each region, there are 2 machines.
I am aware that FDB cluster can work even if one machine is not okay. indeed it does.

But I found that fdb cluster became unhealthy with following message after sudden reboot of a machine.

$ fdb --exec "status json"
.....
      "database_status" : {
            "available" : false,
            "healthy" : false
        },
......
        "messages" : [
            {
                "description" : "Unable to locate the data distributor worker.",
                "name" : "unreachable_dataDistributor_worker"
            },
            {
                "description" : "Unable to read database configuration.",
                "name" : "unreadable_configuration"
            }
        ],

Later I came to know that firewalld of the machine blocked all incoming packet. outgoing packet seems okay. So I stop the fdb processes on the machine, then fdb cluster status became okay.

So it seems that shutting down machine/stopping the fdb process looks okay, but being only half connected(incoming packet blocked/outgoing packet passed) has an impact on the cluster health.

When I stop the fdb process on the host, only 5 machines are visible from the result of fdbcli --exec status json
But when incoming packet was blocked by firewalld, I could see 6 machines from the result of fdbcli --exec status json. Even some roles were assigned on the unreachable machine. and I found that “No route to host” error occurred from other 5 machines.

So I wonder

Is there anything else I can do, besides stopping the FDB on the unreachable host?
How does the fdb cluster determine a machine is alive then assign a role? why, even “no route to host” error occurs, does the cluster not detach the inaccessible machine?

Thank you.

Topic		Replies	Views
Foundationdb cluster became unavailable after shutting down 1 az Using FoundationDB	0	246	August 23, 2023
connectionKeeper constantly try to connect an unreachable machine Using FoundationDB	0	147	December 27, 2023
FDB cluster with three_datacenter mode becomes unavailable if one of three DCs has network card failure Running FoundationDB	1	337	December 13, 2022
Temporary hardware failure on singly replicated cluster Using FoundationDB	2	464	August 28, 2020
Recover 'unreachable' Using FoundationDB	3	871	January 15, 2021

When one machine becomes inaccessible from the cluster, what impact does cluster have?

Related topics