Detecting network partitions from client

What strategies are people using to detect the scenario when a client can no longer connect to any processes in an FDB cluster? We sometimes run into a situation where a client fails to communicate with FDB, and the cause could be:

  1. A problem with the application code
  2. A problem with the FDB cluster
  3. The client was unable to establish a TCP connection with any process in the cluster

We would like to find an easier way to identify when #3 is the case, in order to reach out to cloud providers with concrete evidence of a networking issue. Do others have good ways of detecting this issue, or is there some tracing that we could add to help?

1 Like

One way to get some insight into what’s going on from a client is to read the status key \xff\xff/status/json. It should be able to tell you whether the coordinators are reachable, and if they are it should rule out that you can’t physically connect. If the coordinators are unreachable, then there are at least a few reasons that might be:

  1. Incompatible protocol version
  2. Incorrect TLS settings
  3. You are talking to the wrong coordinators
  4. The coordinators are really busy and can’t respond
  5. The coordinators can’t be reached on the network

There might be some subtle differences in error messages in the trace logs for some of these, though off the top of my head I’m not sure. It’s also a bit difficult to inspect the trace logs from within the client, and if that’s where you want the info then this isn’t a great answer.

I know fdbcli can tell you if you are trying to talk to processes that you are incompatible with, and there is some desire to surface a similar message on client processes too (and/or also put it in the status message for unreachable coordinators). I’m not sure what the current and planned states of understanding TLS failures on the client are, but @alexmiller might be able to answer that. To what extent we can distinguish the other coordinator connection failures, it seems like it would be useful to surface that information in a similar way that we end up doing it for compatibility problems.

Thanks @ajbeamon. In our use case, we are trying to diagnose past issues for which we don’t have the results of status json. We will look into adding some extra tracing to the client, to help identify when a TCP connection can’t be established at all.

This is a good reason to flatten \xff\xff/status/json. Getting the full json has many drawbacks:

  1. It is slow.
  2. It adds additional load to the coordinator - so if all clients start calling that it could kill a coordinator (or you would see false negatives on the clients).
  3. It might fail even if the coordinators are reachable. That means that it isn’t a good tool to automatically detect what kind of problem the client is facing.

Instead we should probably allow the clients to make some failure analysis by checking explicitly only for parts of status.

I think you’d already be seeing errors that look like:

<Event Severity="20" Time="1587414990.756122" Type="N2_ConnectError" ID="___" SuppressedEventCount="3" ErrorCode="111" Message="Connection refused" Machine="___" LogGroup="___" Roles="CD,SS" />

Though they’re spammy, and don’t give the PeerAddr itself, you’d need to look for corresponding PeerDestroy lines. But from a log analysis perspective, the information should already be there.

For TLS, you’d also need to look out for FDBLibTLSPolicyFailure or TLSPolicyFailure events on the client or server, that would indicate that TLS handshake rejections. EOF errors during connections also tend to indicate TLS problems, but aren’t as definitive.

1 Like