TLS on ocp timing issues?

tangerine · January 14, 2022, 9:28pm

HI,
I am running TLS on ocp. One strange thing we are seeing in this particular is that when we try to connect to it using fdbcli, it get a kind of slow connection. It has 3 coordinators.
When first run fdbcli, not all coordinators are reported reachable.
If I stay in the fdbcli and then run the status in 1 minute time, then more coordinator will be reachable and eventually all 3 is reachable. If I keep running status after all reachable, then it will stay that way. But if I wait about 1-2 minutes with no activity and then try status again, it will report more unreachable.
My question is, is this cpu related? Do I need to give each pod more CPU resource?
Is there a timing control on this? Seems to me on this particular cluster that the TLS verification is taking longer then expected?

johscheuer · January 17, 2022, 6:05am

I would try and provision the cluster with more CPU resources the TLS handshakes will introduce some additional overhead and depending on the current configuration that might lead to throttling (if you have some monitoring like Prometheus in place you should be be able to observe the throttling).

tangerine · January 19, 2022, 9:02pm

Hi Johannes,
We moved the cluster to nodes with more cpus and still having the issue. We took a look at the trace log and this keep coming up, do you know what that’s mean?

<Event Severity="20" Time="1642625895.677015" DateTime="2022-01-19T20:58:15Z" Type="N2_AcceptHandshakeError" ID="0000000000000000" SuppressedEventCount="0" ErrorCode="1" Message="stream truncated" Machine="52.118.78.49:4500" LogGroup="fdb-tls" Roles="CD,DD,RK,SS" />

So, what is the “stream truncated” means in the above?

johscheuer · January 20, 2022, 7:54am

Are there any other TLS related issues (normally those events have the type “*TLS*”? From the boost docs the error indicates the following (spoiler: I’m not an C++ expert):

Set to indicate what error occurred, if any. Specifically, StreamTruncated will be set if the peer has closed the connection but did not properly shut down the SSL connection.

So I guess something on your setup is not correct? Could you check for additional events of the TLS or N2 type that maybe indicate what’s wrong. You should also ensure that your OCP setup has the correct MTU’s on the different layers (e.g. host, container network). You might also want to use wireshark to check the packets send to see if some of them are corrupted (or the MTU doesn’t match).

tangerine · January 20, 2022, 2:58pm

I see a lot of TLSIncomingConnectionThrottlingWarning
Type=“TLSIncomingConnectionThrottlingWarning” ID=“0000000000000000” SuppressedEventCount=“1” PeerIP=“10.240.128.7” Machine=“52.118.78.49:4500” LogGroup=“fdb-tls” Roles=“CD,DD,RK,SS” />

alexmiller · January 21, 2022, 3:41am

I think that was implemented for when there’s misconfigured TLS clients, which get rejected and then try to reconnect. This was implemented to try and lessen the effect of a small number of misconfigured TLS clients causing FDB clusters to be unavailable as they were just being forced to do TLS verification over and over again.

I’d suggest looking for TLSPolicyFailure messages, or check status json cluster.processes.*.network.tls_policy_failure to see if there’s any TLS cert verification issues going on.

tangerine · January 21, 2022, 2:57pm

Ok, I will try to take a look. We do get connected but it is on/off so I think the TLS is kind of working. Also, running the fdbcli with --debug_tls yield no error in the TLS part.

tangerine · February 8, 2022, 3:56pm

After I increased the CPU and memory for each pod from the default of 250m and 128Mi to 600m and 526Mi, the connection is more stable but there are still N2_readError due to stream truncation every second. That impact the performance. I can see the performance hit in fdbcli when issuing the status command, I can wait for a few seconds before it comes back but status is still Healthy and all coordinators are reachable.
As for additional error, I don’t see any more TLS related error but I do see this one:

<Event Severity="20" Time="1644335443.560660" DateTime="2022-02-08T15:50:43Z" Type="N2_ReadError" ID="791d51045cd45366" SuppressedEventCount="4" ErrorCode="1" Message="stream truncated" Machine="52.117.13.21:4500" LogGroup="sample-cluster" Roles="CD,RV,SS" />
<Event Severity="10" Time="1644335443.560660" DateTime="2022-02-08T15:50:43Z" Type="ConnectionClosed" ID="791d51045cd45366" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="4" PeerAddr="150.240.65.205:4500:tls" Machine="52.117.13.21:4500" LogGroup="sample-cluster" Roles="CD,RV,SS" />

Any help would be appreciated.

Topic		Replies	Views
Getting N2_Read error when running fdbcli in a TLS enabled cluster on OCP Kubernetes Operator operator	0	434	March 30, 2022
Converting cluster from non-TLS to TLS seems to get stuck on coordinator change Kubernetes Operator	4	70	June 18, 2025
Running backups on tls-enabled cluster Kubernetes Operator	1	770	April 1, 2020
TLS Mixed Cluster v7.3 Running FoundationDB	0	58	August 20, 2024
Looking to tweak FDB performance on Kubernetes with the FDB Operator Running FoundationDB performance	1	508	September 18, 2023

TLS on ocp timing issues?

Related topics