Pods trying to connect to IPs not anymore assigned to FDB pods

mpatou_openai · April 24, 2025, 12:35am

I’m starting to witness a weird behavior while investigating other issues:
Lots of pods are trying to connect to other IPs that are not anymore assigned to current pods for FDB.
For instance on the pod currently holding the CC role:

{  "Severity": "10", "Time": "1745441774.112936", "DateTime": "2025-04-23T20:56:14Z", "Type": "ConnectingTo", "ID": "0000000000000000", "SuppressedEventCount": "14", "PeerAddr": "10.193.1.48:4500:tls(fromHostname)", "PeerAddress": "10.193.1.48:4500:              tls(fromHostname)", "PeerReferences": "14", "FailureStatus": "FAILED", "ThreadID": "xxx", "Machine": "10.193.0.244:4500", "LogGroup": "fdb-cluster", "Roles": "CC" }
...
{  "Severity": "10", "Time": "1745449091.502429", "DateTime": "2025-04-23T22:58:11Z", "Type": "ConnectionClosed", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": "Network connection failed", "ErrorCode": "1026",                        "SuppressedEventCount": "7", "PeerAddr": "10.193.1.48:4500:tls(fromHostname)", "PeerAddress": "10.193.1.48:4500:tls(fromHostname)", "ThreadID": "xxx", "Machine": "10.193.0.244:4500", "LogGroup": "fdb-cluster", "Roles": "CC" }

The IP 10.193.1.48 is currently allocated in my k8 cluster but to a non FDB pod and you can see the issue has been there at least for 2 hours I suspect that at some point a pod had this IP but it’s gone for a while. Why are process still trying to connect to IPs that are (long) gone ?

johscheuer · April 28, 2025, 12:24pm

That’s a bug in the networking layer in FDB. Every fdbserver process keeps an in-memory map of connected peers and in theory old peers should be removed when they are disappearing. We are already investigating this issue but it’s a bit more complex to solve. The same issue can exit on the client side. Restarting the fdbserver processes will clean up the in-memory map if those additional connections should cause issues.

mpatou_openai · July 10, 2025, 12:02am

cc. @jzhou
This is actually quite a problem it turns out:

We have logs that are littered with scary error messages:

PeerUnavailableForLongTime

clients seems to have problems realizing that the IP of the storage pods have changed and so they keep trying to connect to the same IP and fail and then they fail over to the remote region (in a multi region setup) with the message: AllLocalAlternativesFailed

johscheuer · July 22, 2025, 3:15pm

We have logs that are littered with scary error messages

That message is somehow expected in this scenario.

clients seems to have problems realizing that the IP of the storage pods have changed and so they keep trying to connect to the same IP and fail and then they fail over to the remote region (in a multi region setup) with the message: AllLocalAlternativesFailed

Is that something you see consistently or only for a brief period?

You can restart the clients, that will clear up their state. The team is still working on fixing this but like I said in the other comment, the issue seems to be a bit more complex.

mpatou_openai · August 11, 2025, 8:13pm

Is that something you see consistently or only for a brief period?

It is somewhat long ( >24h) indeed restarting the client did fix the issue but it was non intuitive.

Topic		Replies	Views
FDB client connects to old IP addresses Using FoundationDB	1	354	March 24, 2023
Stateless node keep reaching out to removed storage node Kubernetes Operator operator	0	395	April 14, 2022
K8s operator fdb.cluster IP addresses issue Kubernetes Operator operator	5	1252	September 18, 2020
Recovering from FoundationDB crashes Kubernetes Operator operator	5	946	August 24, 2021
Changing public address of (non-coordinator) fdbserver corrupted cluster Using FoundationDB	8	1950	March 16, 2021

Pods trying to connect to IPs not anymore assigned to FDB pods

Related topics