Specifying process roles in TLS-enabled cluster

Hello!

We’ve been using FoundationDB for a while. We have successfully been running TLS-enabled clusters for about a year now, but started facing some issues with TLS this week when using new creds (our older credentials expired). I will provide the details of the cluster configuration etc below, but tl;dr the problem comes when we specify process roles and enable TLS for our cluster. We are running a double replication ssd-mode cluster on 7 boxes (RHEL 7) using Foundationdb-6.2.19. The exact problem is specified as below (in 2 scenarios):

Scenario 1: TLS on, we specify process roles

  1. Fdbcli status toggles between a)showing all coordinators as unreachable and b)showing up Healthy
  2. Transactions timeout: java.util.concurrent.CompletionException: com.apple.foundationdb.FDBException: Operation aborted because the transaction timed out

Scenario 2: TLS on, we don’t specify “any” roles

  1. Fdbcli status toggles between a)showing all coordinators as unreachable and b)showing up Healthy. Also, we get “WARNING: A single process is both a transaction log and a storage server. For best performance use dedicated disks for the transaction logs by setting process classes. “
  2. Transactions don’t timeout (at least for the first 2-3 hours).

Additionally, the same cluster runs perfectly smoothly with TLS-disabled. We have verified the TLS certificates with a different cluster(diff configuration) and see no issues there. The process role assignment is as below:

  • hostname: A
    processes: 2
    coordinator: true
    1 log process, 1 stateless
  • hostname: B
    processes: 2
    coordinator: true
    1 storage, 1 stateless
  • hostname: C
    processes: 2
    coordinator: true
    1 storage, 1 stateless
  • hostname: D
    processes: 3
    coordinator: true
    2 storage, 1 stateless
  • hostname: E
    processes: 2
    coordinator: true
    1 log process, 1 stateless
  • hostname: F
    processes: 3
    coordinator: true
    2 storage, 1 stateless
  • hostname: G
    processes: 3
    coordinator: true
    2 storage, 1 stateless

Some trace logs are as below:

Event Severity=“10” Time="" Type=“TLSPolicyFailure” ID=“0000000000000000” SuppressedEventCount=“0” Reason=“preverification failed” VerifyError=“self signed certificate” Machine="" LogGroup=“default” Roles=“CD,SS”
Event Severity=“10” Time="" Type=“TLSIncomingConnectionThrottlingWarning” ID=“0000000000000000” SuppressedEventCount=“0” PeerIP="" Machine="" LogGroup=“default” Roles=“CD,SS”

Event Severity=“10” Time="" Type=“TLSPolicyFailure” ID=“0000000000000000” SuppressedEventCount=“0” Reason=“preverification failed” VerifyError=“self signed certificate” Machine="" LogGroup=“default” Roles=“CD,SS”
Event Severity=“20” Time="" Type=“N2_AcceptHandshakeError” ID=“0000000000000000” SuppressedEventCount=“0” ErrorCode=“337100934” Message=“certificate verify failed” WhichMeans=“error:1417C086:SSL routines:tl s_process_client_certificate:certificate verify failed” Machine="" LogGroup=“default” Roles=“CD,SS”
Event Severity=“10” Time="" Type=“IncomingConnectionError” ID=“1ec863c4606aedde” Error=“connection_failed” ErrorDescription=“Network connection failed” ErrorCode=“1026” SuppressedEventCount=“1” FromAddress=" " Machine="" LogGroup=“default” Roles=“CD,SS”
Event Severity=“10” Time="" Type=“TLSIncomingConnectionThrottlingWarning” ID=“0000000000000000” SuppressedEventCount=“0” PeerIP="" Machine="" LogGroup=“default” Roles=“CD,SS”
Event Severity=“20” Time="" Type=“SlowSSLoopx100” ID=“d82525c1ce123bfb” Elapsed=“0.102445” Machine="" LogGroup=“default” Roles=“CD,SS”
Event Severity=“10” Time="" Type=“IncomingConnectionError” ID=“044168438e2b724f” Error=“connection_failed” ErrorDescription=“Network connection failed” ErrorCode=“1026” SuppressedEventCount=“0” FromAddress= “” Machine="" LogGroup=“default” Roles=“CD,SS”

We suspect the reason for these timeouts have something to do with our configuration but would appreciate any insight on why we’re seeing these errors. Thanks!

Since you’re getting errors about self signed certificates, it’s likely that the root certificate in your certificate chain doesn’t match what’s in the CA cert.

I think 6.2 has a (maybe working?) fdbcli --debug-tls --tls_certificate_path=... --tls_key_path=... --tls_ca_path=... that will print out the certificate data for you.

I have no explanation for how/why process classes would affect this, unless you have an inconsistent TLS configuration across processes, and adding process restrictions removes some of the (misconfigured?) processes that have a different CA cert loaded or accept invalid certificates (Check.Verify=0?). If you have a large number of clients, some of which have the new certificates and some of which don’t, then this is maybe just the case of clients continuously trying to reconnect to the cluster after getting rejected effectively DoS’s FDB processes. Certificate verification is computationally expensive…

Status json has a .cluster.processes.<hash>.network.tls_policy_failures, which gives you the number of connections rejected due to failing certificate verification per second. I’d recommend adding something to your monitoring and alerting to graph the sum of that across all processes to make it easy to identify when you have misconfigured TLS clients that could cause problems for your cluster.

Thank you for the prompt response, @alexmiller. We have a limited amount of clients, far less than the usual load we have in our other clusters.

We are running this cluster on smaller machines (than our usual ones) which is why what you said about “Certificate verification is computationally expensive” catches my eye.

Nevertheless, will verify the root certificate and post here. Thanks again!