Pods trying to connect to IPs not anymore assigned to FDB pods

I’m starting to witness a weird behavior while investigating other issues:
Lots of pods are trying to connect to other IPs that are not anymore assigned to current pods for FDB.
For instance on the pod currently holding the CC role:

{  "Severity": "10", "Time": "1745441774.112936", "DateTime": "2025-04-23T20:56:14Z", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "14", "PeerAddr": "10.193.1.48:4500:tls(fromHostname)", "PeerAddress": "10.193.1.48:4500:              tls(fromHostname)", "PeerReferences": "14", "FailureStatus": "FAILED", "ThreadID": "xxx", "Machine": "10.193.0.244:4500", "LogGroup": "fdb-cluster", "Roles": "CC" }
...
{  "Severity": "10", "Time": "1745449091.502429", "DateTime": "2025-04-23T22:58:11Z", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026",                        "SuppressedEventCount": "7", "PeerAddr": "10.193.1.48:4500:tls(fromHostname)", "PeerAddress": "10.193.1.48:4500:tls(fromHostname)", "ThreadID": "xxx", "Machine": "10.193.0.244:4500", "LogGroup": "fdb-cluster", "Roles": "CC" }

The IP 10.193.1.48 is currently allocated in my k8 cluster but to a non FDB pod and you can see the issue has been there at least for 2 hours I suspect that at some point a pod had this IP but it’s gone for a while. Why are process still trying to connect to IPs that are (long) gone ?

That’s a bug in the networking layer in FDB. Every fdbserver process keeps an in-memory map of connected peers and in theory old peers should be removed when they are disappearing. We are already investigating this issue but it’s a bit more complex to solve. The same issue can exit on the client side. Restarting the fdbserver processes will clean up the in-memory map if those additional connections should cause issues.

1 Like

cc. @jzhou
This is actually quite a problem it turns out:

  • We have logs that are littered with scary error messages:
PeerUnavailableForLongTime
  • clients seems to have problems realizing that the IP of the storage pods have changed and so they keep trying to connect to the same IP and fail and then they fail over to the remote region (in a multi region setup) with the message: AllLocalAlternativesFailed