TLS on ocp timing issues?

HI,
I am running TLS on ocp. One strange thing we are seeing in this particular is that when we try to connect to it using fdbcli, it get a kind of slow connection. It has 3 coordinators.
When first run fdbcli, not all coordinators are reported reachable.
If I stay in the fdbcli and then run the status in 1 minute time, then more coordinator will be reachable and eventually all 3 is reachable. If I keep running status after all reachable, then it will stay that way. But if I wait about 1-2 minutes with no activity and then try status again, it will report more unreachable.
My question is, is this cpu related? Do I need to give each pod more CPU resource?
Is there a timing control on this? Seems to me on this particular cluster that the TLS verification is taking longer then expected?

I would try and provision the cluster with more CPU resources the TLS handshakes will introduce some additional overhead and depending on the current configuration that might lead to throttling (if you have some monitoring like Prometheus in place you should be be able to observe the throttling).

Hi Johannes,
We moved the cluster to nodes with more cpus and still having the issue. We took a look at the trace log and this keep coming up, do you know what that’s mean?

<Event Severity="20" Time="1642625895.677015" DateTime="2022-01-19T20:58:15Z" Type="N2_AcceptHandshakeError" ID="0000000000000000" SuppressedEventCount="0" ErrorCode="1" Message="stream truncated" Machine="52.118.78.49:4500" LogGroup="fdb-tls" Roles="CD,DD,RK,SS" />

So, what is the “stream truncated” means in the above?

Are there any other TLS related issues (normally those events have the type “*TLS*”? From the boost docs the error indicates the following (spoiler: I’m not an C++ expert):

Set to indicate what error occurred, if any. Specifically, StreamTruncated will be set if the peer has closed the connection but did not properly shut down the SSL connection.

So I guess something on your setup is not correct? Could you check for additional events of the TLS or N2 type that maybe indicate what’s wrong. You should also ensure that your OCP setup has the correct MTU’s on the different layers (e.g. host, container network). You might also want to use wireshark to check the packets send to see if some of them are corrupted (or the MTU doesn’t match).

I see a lot of TLSIncomingConnectionThrottlingWarning
Type=“TLSIncomingConnectionThrottlingWarning” ID=“0000000000000000” SuppressedEventCount=“1” PeerIP=“10.240.128.7” Machine=“52.118.78.49:4500” LogGroup=“fdb-tls” Roles=“CD,DD,RK,SS” />

I think that was implemented for when there’s misconfigured TLS clients, which get rejected and then try to reconnect. This was implemented to try and lessen the effect of a small number of misconfigured TLS clients causing FDB clusters to be unavailable as they were just being forced to do TLS verification over and over again.

I’d suggest looking for TLSPolicyFailure messages, or check status json cluster.processes.*.network.tls_policy_failure to see if there’s any TLS cert verification issues going on.

Ok, I will try to take a look. We do get connected but it is on/off so I think the TLS is kind of working. Also, running the fdbcli with --debug_tls yield no error in the TLS part.

After I increased the CPU and memory for each pod from the default of 250m and 128Mi to 600m and 526Mi, the connection is more stable but there are still N2_readError due to stream truncation every second. That impact the performance. I can see the performance hit in fdbcli when issuing the status command, I can wait for a few seconds before it comes back but status is still Healthy and all coordinators are reachable.
As for additional error, I don’t see any more TLS related error but I do see this one:

<Event Severity="20" Time="1644335443.560660" DateTime="2022-02-08T15:50:43Z" Type="N2_ReadError" ID="791d51045cd45366" SuppressedEventCount="4" ErrorCode="1" Message="stream truncated" Machine="52.117.13.21:4500" LogGroup="sample-cluster" Roles="CD,RV,SS" />
<Event Severity="10" Time="1644335443.560660" DateTime="2022-02-08T15:50:43Z" Type="ConnectionClosed" ID="791d51045cd45366" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="4" PeerAddr="150.240.65.205:4500:tls" Machine="52.117.13.21:4500" LogGroup="sample-cluster" Roles="CD,RV,SS" />

Any help would be appreciated.