Detecting network partitions from client

trevor.clinkenbeard · April 11, 2020, 2:02am

What strategies are people using to detect the scenario when a client can no longer connect to any processes in an FDB cluster? We sometimes run into a situation where a client fails to communicate with FDB, and the cause could be:

A problem with the application code
A problem with the FDB cluster
The client was unable to establish a TCP connection with any process in the cluster

We would like to find an easier way to identify when #3 is the case, in order to reach out to cloud providers with concrete evidence of a networking issue. Do others have good ways of detecting this issue, or is there some tracing that we could add to help?

ajbeamon · April 13, 2020, 6:36pm

One way to get some insight into what’s going on from a client is to read the status key \xff\xff/status/json. It should be able to tell you whether the coordinators are reachable, and if they are it should rule out that you can’t physically connect. If the coordinators are unreachable, then there are at least a few reasons that might be:

Incompatible protocol version
Incorrect TLS settings
You are talking to the wrong coordinators
The coordinators are really busy and can’t respond
The coordinators can’t be reached on the network

There might be some subtle differences in error messages in the trace logs for some of these, though off the top of my head I’m not sure. It’s also a bit difficult to inspect the trace logs from within the client, and if that’s where you want the info then this isn’t a great answer.

I know fdbcli can tell you if you are trying to talk to processes that you are incompatible with, and there is some desire to surface a similar message on client processes too (and/or also put it in the status message for unreachable coordinators). I’m not sure what the current and planned states of understanding TLS failures on the client are, but @alexmiller might be able to answer that. To what extent we can distinguish the other coordinator connection failures, it seems like it would be useful to surface that information in a similar way that we end up doing it for compatibility problems.

trevor.clinkenbeard · April 17, 2020, 4:03pm

Thanks @ajbeamon. In our use case, we are trying to diagnose past issues for which we don’t have the results of status json. We will look into adding some extra tracing to the client, to help identify when a TCP connection can’t be established at all.

markus.pilman · April 17, 2020, 4:15pm

This is a good reason to flatten \xff\xff/status/json. Getting the full json has many drawbacks:

It is slow.
It adds additional load to the coordinator - so if all clients start calling that it could kill a coordinator (or you would see false negatives on the clients).
It might fail even if the coordinators are reachable. That means that it isn’t a good tool to automatically detect what kind of problem the client is facing.

Instead we should probably allow the clients to make some failure analysis by checking explicitly only for parts of status.

alexmiller · April 20, 2020, 8:44pm

I think you’d already be seeing errors that look like:

<Event Severity="20" Time="1587414990.756122" Type="N2_ConnectError" ID="___" SuppressedEventCount="3" ErrorCode="111" Message="Connection refused" Machine="___" LogGroup="___" Roles="CD,SS" />

Though they’re spammy, and don’t give the PeerAddr itself, you’d need to look for corresponding PeerDestroy lines. But from a log analysis perspective, the information should already be there.

For TLS, you’d also need to look out for FDBLibTLSPolicyFailure or TLSPolicyFailure events on the client or server, that would indicate that TLS handshake rejections. EOF errors during connections also tend to indicate TLS problems, but aren’t as definitive.

Topic		Replies	Views
Could not communicate with a quorum of coordination servers Using FoundationDB	2	2213	March 5, 2020
Recover 'unreachable' Using FoundationDB	3	869	January 15, 2021
One of the specified coordinators is unreachable Using FoundationDB	2	550	March 19, 2022
Client library backwards compatibility issues Using FoundationDB	1	1314	November 30, 2018
All Coordinators Crashed At Same Time Using FoundationDB	14	2096	September 9, 2019

Detecting network partitions from client

Related topics