Hi.
I have FDB Cluster with total 6 machines with three_data_hall. In each region, there are 2 machines.
I am aware that FDB cluster can work even if one machine is not okay. indeed it does.
But I found that fdb cluster became unhealthy with following message after sudden reboot of a machine.
$ fdb --exec "status json"
.....
"database_status" : {
"available" : false,
"healthy" : false
},
......
"messages" : [
{
"description" : "Unable to locate the data distributor worker.",
"name" : "unreachable_dataDistributor_worker"
},
{
"description" : "Unable to read database configuration.",
"name" : "unreadable_configuration"
}
],
Later I came to know that firewalld of the machine blocked all incoming packet. outgoing packet seems okay. So I stop the fdb processes on the machine, then fdb cluster status became okay.
So it seems that shutting down machine/stopping the fdb process looks okay, but being only half connected(incoming packet blocked/outgoing packet passed) has an impact on the cluster health.
When I stop the fdb process on the host, only 5 machines are visible from the result of fdbcli --exec status json
But when incoming packet was blocked by firewalld, I could see 6 machines from the result of fdbcli --exec status json
. Even some roles were assigned on the unreachable machine. and I found that “No route to host” error occurred from other 5 machines.
So I wonder
- Is there anything else I can do, besides stopping the FDB on the unreachable host?
- How does the fdb cluster determine a machine is alive then assign a role? why, even “no route to host” error occurs, does the cluster not detach the inaccessible machine?
Thank you.