Hi. I have FDB Cluster(verison: 6.2.26) with total 6 machines with three_data_hall. In each region, there are 2 machines.
I found that FDB cluster recovery getting stuck when a machine(with wrong firewall settings) denied all incoming packet and allowed all outgoing packet.
....
"messages" : [
{
"description" : "Unable to read database configuration.",
"name" : "unreadable_configuration"
}
],
....
"recovery_state" : {
"active_generations" : 1,
"description" : "Initializing new transaction servers and recovering transaction logs.",
"name" : "initializing_transaction_servers"
}
From master process logs, I found that connectionKeeper continue to attempt to connection to an unreachable machine even after TooManyConnectionsClosed
. As the machine allows all outgoing packet, it leads the machine reports itself as healthy to the CC. I believe this is the reason why Master process keeps trying to connect that machine.
But, I wonder if there is no limit on the connection retry attempt (or workaround to resolve this issue) Unless I stop that machine manually, the master continue to try to connect and fails, causing the entire cluster to get stuck.
<Event Severity="10" Time="16:32:49:817357000" Type="ConnectingTo" ID="0000000000000000" SuppressedEventCount="1" PeerAddr="172.16.1.17:20000" Machine="172.18.1.16:20002" LogGroup="default" Roles="MS" />
<Event Severity="10" Time="16:32:50:830088000" Type="ConnectionTimedOut" ID="0000000000000000" SuppressedEventCount="2" PeerAddr="172.16.1.17:20000" Machine="172.18.1.16:20002" LogGroup="default" Roles="MS" />
<Event Severity="10" Time="16:32:50:830088000" Type="ConnectionClosed" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="2" PeerAddr="172.16.1.17:20000" Machine="172.18.1.16:20002" LogGroup="default" Roles="MS" />
<Event Severity="30" Time="16:32:50:830088000" Type="TooManyConnectionsClosed" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="172.16.1.17:20000" Machine="172.18.1.16:20002" LogGroup="default" Roles="MS" />
<Event Severity="10" Time="16:32:50:841474000" Type="ConnectingTo" ID="0000000000000000" SuppressedEventCount="2" PeerAddr="172.16.1.17:20000" Machine="172.18.1.16:20002" LogGroup="default" Roles="MS" />
<Event Severity="10" Time="16:32:51:851511000" Type="ConnectionTimedOut" ID="0000000000000000" SuppressedEventCount="1" PeerAddr="172.16.1.17:20000" Machine="172.18.1.16:20002" LogGroup="default" Roles="MS" />
Thank you.