Hello!
We’ve been using FoundationDB for a while. We have successfully been running TLS-enabled clusters for about a year now, but started facing some issues with TLS this week when using new creds (our older credentials expired). I will provide the details of the cluster configuration etc below, but tl;dr the problem comes when we specify process roles and enable TLS for our cluster. We are running a double replication ssd-mode cluster on 7 boxes (RHEL 7) using Foundationdb-6.2.19. The exact problem is specified as below (in 2 scenarios):
Scenario 1: TLS on, we specify process roles
- Fdbcli status toggles between a)showing all coordinators as unreachable and b)showing up Healthy
- Transactions timeout: java.util.concurrent.CompletionException: com.apple.foundationdb.FDBException: Operation aborted because the transaction timed out
Scenario 2: TLS on, we don’t specify “any” roles
- Fdbcli status toggles between a)showing all coordinators as unreachable and b)showing up Healthy. Also, we get “WARNING: A single process is both a transaction log and a storage server. For best performance use dedicated disks for the transaction logs by setting process classes. “
- Transactions don’t timeout (at least for the first 2-3 hours).
Additionally, the same cluster runs perfectly smoothly with TLS-disabled. We have verified the TLS certificates with a different cluster(diff configuration) and see no issues there. The process role assignment is as below:
- hostname: A
processes: 2
coordinator: true
1 log process, 1 stateless - hostname: B
processes: 2
coordinator: true
1 storage, 1 stateless - hostname: C
processes: 2
coordinator: true
1 storage, 1 stateless - hostname: D
processes: 3
coordinator: true
2 storage, 1 stateless - hostname: E
processes: 2
coordinator: true
1 log process, 1 stateless - hostname: F
processes: 3
coordinator: true
2 storage, 1 stateless - hostname: G
processes: 3
coordinator: true
2 storage, 1 stateless
Some trace logs are as below:
Event Severity=“10” Time=“” Type=“TLSPolicyFailure” ID=“0000000000000000” SuppressedEventCount=“0” Reason=“preverification failed” VerifyError=“self signed certificate” Machine=“” LogGroup=“default” Roles=“CD,SS”
Event Severity=“10” Time=“” Type=“TLSIncomingConnectionThrottlingWarning” ID=“0000000000000000” SuppressedEventCount=“0” PeerIP=“” Machine=“” LogGroup=“default” Roles=“CD,SS”
Event Severity=“10” Time=“” Type=“TLSPolicyFailure” ID=“0000000000000000” SuppressedEventCount=“0” Reason=“preverification failed” VerifyError=“self signed certificate” Machine=“” LogGroup=“default” Roles=“CD,SS”
Event Severity=“20” Time=“” Type=“N2_AcceptHandshakeError” ID=“0000000000000000” SuppressedEventCount=“0” ErrorCode=“337100934” Message=“certificate verify failed” WhichMeans=“error:1417C086:SSL routines:tl s_process_client_certificate:certificate verify failed” Machine=“” LogGroup=“default” Roles=“CD,SS”
Event Severity=“10” Time=“” Type=“IncomingConnectionError” ID=“1ec863c4606aedde” Error=“connection_failed” ErrorDescription=“Network connection failed” ErrorCode=“1026” SuppressedEventCount=“1” FromAddress=" " Machine=“” LogGroup=“default” Roles=“CD,SS”
Event Severity=“10” Time=“” Type=“TLSIncomingConnectionThrottlingWarning” ID=“0000000000000000” SuppressedEventCount=“0” PeerIP=“” Machine=“” LogGroup=“default” Roles=“CD,SS”
Event Severity=“20” Time=“” Type=“SlowSSLoopx100” ID=“d82525c1ce123bfb” Elapsed=“0.102445” Machine=“” LogGroup=“default” Roles=“CD,SS”
Event Severity=“10” Time=“” Type=“IncomingConnectionError” ID=“044168438e2b724f” Error=“connection_failed” ErrorDescription=“Network connection failed” ErrorCode=“1026” SuppressedEventCount=“0” FromAddress= “” Machine=“” LogGroup=“default” Roles=“CD,SS”
We suspect the reason for these timeouts have something to do with our configuration but would appreciate any insight on why we’re seeing these errors. Thanks!