Suspected bug in config broadcaster

Hi there,

I believe there could be a bug in the config broadcaster; let me give some context: I am using clusters with FoundationDB 7.3.27 and operator v1.33; k8s services are enabled for routing, and I find that the tailing-sidecar contains a lot of N2_ConnectError log lines (see copy/paste below).

I produced these log lines by grepping for the IP address which a stateless process is trying to reach.

I think that this is an IP of the k8s service of an older coordinator which does not exist anymore. In some cases I can find this IP in the exclude list, sometimes not; also, it seems that only stateless processes are generating this traffic towards old coordinators.

Restarting the fdbserver process is ineffective, however if I instead delete all the stateless pods which generate this traffic, one by one, the issue is gone for some time and I can see no more traffic towards non-existing old service IPs. This indicates that some kind of state in the stateless pods is responsible for reaching out to these ghost coordinators.
After ~30min I can see again the IPs, which leads me to believe that the processes are bouncing around information about these older coordinator IPs, and it’s never really forgotten.

I would also be tempted to use --no-config-db on the fdbserver processes’ command line, but I am concerned of possible safety issues and it’s not supported by the operator anyways.

I do not want to open a bug if there is no clear proof of it; I could not find in the git log specific commits about this possible issue, and I am available for further question or testcases as I can reliably see this problem on clusters.

Thanks for any insight you can provide!

{  "Severity": "10", "Type": "PingLatency", "ID": "0000000000000000", "Elapsed": "75.0007", "PeerAddr": "172.20.31.95:4501", "MinLatency": "1.79769e+308", "MaxLatency": "-1.79769e+308", "MeanLatency": "0", "MedianLatency": "0", "P90Latency": "0", "Count": "0", "BytesReceived": "0", "BytesSent": "0", "TimeoutCount": "0", "ConnectOutgoingCount": "25", "ConnectIncomingCount": "0", "ConnectFailedCount": "25", "ConnectMinLatency": "1.79769e+308", "ConnectMaxLatency": "-1.79769e+308", "ConnectMeanLatency": "0", "ConnectMedianLatency": "0", "ConnectP90Latency": "0", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "PingLatency", "ID": "0000000000000000", "Elapsed": "75.0007", "PeerAddr": "172.20.31.95:4501", "MinLatency": "1.79769e+308", "MaxLatency": "-1.79769e+308", "MeanLatency": "0", "MedianLatency": "0", "P90Latency": "0", "Count": "0", "BytesReceived": "0", "BytesSent": "0", "TimeoutCount": "0", "ConnectOutgoingCount": "25", "ConnectIncomingCount": "0", "ConnectFailedCount": "25", "ConnectMinLatency": "1.79769e+308", "ConnectMaxLatency": "-1.79769e+308", "ConnectMeanLatency": "0", "ConnectMedianLatency": "0", "ConnectP90Latency": "0", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "PingLatency", "ID": "0000000000000000", "Elapsed": "75.0006", "PeerAddr": "172.20.31.95:4501", "MinLatency": "1.79769e+308", "MaxLatency": "-1.79769e+308", "MeanLatency": "0", "MedianLatency": "0", "P90Latency": "0", "Count": "0", "BytesReceived": "0", "BytesSent": "0", "TimeoutCount": "0", "ConnectOutgoingCount": "25", "ConnectIncomingCount": "0", "ConnectFailedCount": "25", "ConnectMinLatency": "1.79769e+308", "ConnectMaxLatency": "-1.79769e+308", "ConnectMeanLatency": "0", "ConnectMedianLatency": "0", "ConnectP90Latency": "0", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "PingLatency", "ID": "0000000000000000", "Elapsed": "75.0007", "PeerAddr": "172.20.31.95:4501", "MinLatency": "1.79769e+308", "MaxLatency": "-1.79769e+308", "MeanLatency": "0", "MedianLatency": "0", "P90Latency": "0", "Count": "0", "BytesReceived": "0", "BytesSent": "0", "TimeoutCount": "0", "ConnectOutgoingCount": "25", "ConnectIncomingCount": "0", "ConnectFailedCount": "25", "ConnectMinLatency": "1.79769e+308", "ConnectMaxLatency": "-1.79769e+308", "ConnectMeanLatency": "0", "ConnectMedianLatency": "0", "ConnectP90Latency": "0", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "PingLatency", "ID": "0000000000000000", "Elapsed": "75.0008", "PeerAddr": "172.20.31.95:4501", "MinLatency": "1.79769e+308", "MaxLatency": "-1.79769e+308", "MeanLatency": "0", "MedianLatency": "0", "P90Latency": "0", "Count": "0", "BytesReceived": "0", "BytesSent": "0", "TimeoutCount": "0", "ConnectOutgoingCount": "25", "ConnectIncomingCount": "0", "ConnectFailedCount": "25", "ConnectMinLatency": "1.79769e+308", "ConnectMaxLatency": "-1.79769e+308", "ConnectMeanLatency": "0", "ConnectMedianLatency": "0", "ConnectP90Latency": "0", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "PingLatency", "ID": "0000000000000000", "Elapsed": "75.0006", "PeerAddr": "172.20.31.95:4501", "MinLatency": "1.79769e+308", "MaxLatency": "-1.79769e+308", "MeanLatency": "0", "MedianLatency": "0", "P90Latency": "0", "Count": "0", "BytesReceived": "0", "BytesSent": "0", "TimeoutCount": "0", "ConnectOutgoingCount": "25", "ConnectIncomingCount": "0", "ConnectFailedCount": "25", "ConnectMinLatency": "1.79769e+308", "ConnectMaxLatency": "-1.79769e+308", "ConnectMeanLatency": "0", "ConnectMedianLatency": "0", "ConnectP90Latency": "0", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "PingLatency", "ID": "0000000000000000", "Elapsed": "81.0007", "PeerAddr": "172.20.31.95:4501", "MinLatency": "1.79769e+308", "MaxLatency": "-1.79769e+308", "MeanLatency": "0", "MedianLatency": "0", "P90Latency": "0", "Count": "0", "BytesReceived": "0", "BytesSent": "0", "TimeoutCount": "0", "ConnectOutgoingCount": "27", "ConnectIncomingCount": "0", "ConnectFailedCount": "27", "ConnectMinLatency": "1.79769e+308", "ConnectMaxLatency": "-1.79769e+308", "ConnectMeanLatency": "0", "ConnectMedianLatency": "0", "ConnectP90Latency": "0", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "PeerReferences": "5", "FailureStatus": "FAILED", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }
{  "Severity": "10", "Type": "ConnectionTimedOut", "ID": "0000000000000000", "SuppressedEventCount": "1", "PeerAddr": "172.20.31.95:4501", "ThreadID": "5345288753128754575", "Machine": "172.20.200.234:4501", "LogGroup": "my-cluster-c", "Roles": "CS" }

I think that this is an IP of the k8s service of an older coordinator which does not exist anymore. In some cases I can find this IP in the exclude list, sometimes not; also, it seems that only stateless processes are generating this traffic towards old coordinators.

Restarting the fdbserver process is ineffective, however if I instead delete all the statelesspods which generate this traffic, one by one, the issue is gone for some time and I can see no more traffic towards non-existing old service IPs. This indicates that some kind of state in the stateless pods is responsible for reaching out to these ghost coordinators.
After ~30min I can see again the IPs, which leads me to believe that the processes are bouncing around information about these older coordinator IPs, and it’s never really forgotten.

This is a known issue that fdbserver processes try to reach old addresses. I’m not sure if we have a GitHub issue for that (CC @jzhou). Restarting the cluster with fdbcli should also reset this state.

I would also be tempted to use --no-config-db on the fdbserver processes’ command line, but I am concerned of possible safety issues and it’s not supported by the operator anyways.

That should be support with the customParameters as this flag must be set on the fdbserver process. Have you tried setting it and it didn’t work?

Thanks for the reply!

Forgive my ignorance, but how do you restart a cluster via fdbcli?

Haven’t tried it because customParameters would cover this aspect of the functionality but not this other aspect e.g. operator never specifies that option when running the coordinators command via fdbcli, so I concluded it’s not supported.

Forgive my ignorance, but how do you restart a cluster via fdbcli ?

I wouldn’t call that ignorance :slight_smile: The command would be kill; kill all obviously that will cause a minimal downtime. Docs for fdbcli kill: Command Line Interface — FoundationDB 7.1

Haven’t tried it because customParameters would cover this aspect of the functionality but not this other aspect e.g. operator never specifies that option when running the coordinators command via fdbcli , so I concluded it’s not supported.

Good point, could you open a GitHub issue for that in the operator repo. I’m not sure what will happen when the fdbserver processes are started with that flag but the fdbcli is not setting it :thinking:

1 Like

According to that commit message it should make the command stuck:

Failing to specify this option
when the configuration database is not active will not affect the
correctness of the command, but it will hang instead of returning.

Created a new issue here

Requiring extra flags from clients when the configuration database is disabled on the server is not a great experience. I submitted a PR to fix this by always serving the relevant interfaces even if the config DB is disabled Serve no-op configuration database interface when disabled by sfc-gh-ljoswiak · Pull Request #11491 · apple/foundationdb · GitHub. This allows the coordinator change command to succeed regardless of whether the config DB is running or not, without any extra client side logic.

1 Like