Set up fdb to use TLS failure with N2_ConnectError

Hello experts:
I am Setting Up FoundationDB to use TLS following Transport Layer Security — FoundationDB 6.3
foundationdb.conf


## foundationdb.conf
##
## Configuration file for FoundationDB server processes
## Full documentation is available at
## https://apple.github.io/foundationdb/configuration.html#the-configuration-file

[fdbmonitor]
restart_delay = 20

[general]
cluster_file=C:\ProgramData\foundationdb\fdb.cluster

## Default parameters for individual fdbserver processes
[fdbserver]
public_address = auto:$ID
listen_address = public
parentpid = $PID
# logsize = 10MiB
# maxlogssize = 100MiB
# machine_id = 
# datacenter_id = 
# class = 
# memory = 8GiB
# storage_memory = 1GiB
# cache_memory = 2GiB
# metrics_cluster =
# metrics_prefix =
command=C:\Program Files\foundationdb\bin\fdbserver.exe
datadir=C:\ProgramData\foundationdb\data\$ID
logdir=C:\ProgramData\foundationdb\logs

tls_certificate_file=C:\mywork\foundationdb\tls\fdb.pem
tls_key_file=C:\mywork\foundationdb\tls\private.key
tls_verify_peers=Check.Valid=0
tls_ca_file=C:\mywork\foundationdb\tls\cert.crt
## An individual fdbserver process with id 4500
## Parameters set here override defaults from the [fdbserver] section
[fdbserver.4600]

## Default parameters for individual backup_agent processes
[backup-agent]
command=C:\Program Files\foundationdb\bin\backup_agent.exe 

logdir=C:\ProgramData\foundationdb\logs

## An individual backup-agent process
[backup-agent.1]

fdb.cluster
tlsdb:tlsdb@9.30.211.106:4600:tls

I got error:
Kindly advise how to trouble shooting? thank you.

<?xml version="1.0"?>

<Trace>

<Event Severity="10" Time="1637310613.301615" Type="Net2Starting" ID="0000000000000000" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310613.301935" Type="ProgramStart" ID="0000000000000000" SourceVersion="Current version id not currently supported within Windows." Version="6.2.22" PackageName="6.2" ActualTime="1637310613" CommandLine="C:\Program Files\foundationdb\bin\backup_agent.exe --cluster_file=C:\ProgramData\foundationdb\fdb.cluster --logdir=C:\ProgramData\foundationdb\logs" MemoryLimit="8589934592" TrackLatestType="Original" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310613.301672" Type="ClientStart" ID="0000000000000000" SourceVersion="Current version id not currently supported within Windows." Version="6.2.22" PackageName="6.2" ClusterFile="C:\ProgramData\foundationdb\fdb.cluster" ConnectionString="tlsdb:tlsdb@9.30.211.106:4600:tls" ActualTime="1637310613" ApiVersion="-1" ImageOffset="0000000000000000" Machine="9.30.211.106:8748" LogGroup="default" TrackLatestType="Original" />

<Event Severity="10" Time="1637310613.301672" Type="SetupQuery" ID="0000000000000000" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310613.301672" Type="NotifyAddressHealthy" ID="0000000000000000" SuppressedEventCount="0" Address="9.30.211.106:4600:tls" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310613.301672" Type="CodeCoverage" ID="0000000000000000" File="c:\dev\cygwin\home\fdb\jenkins\node\workspace\fdbbuildjobs\fdbbuildwindows\work\_build_\src\fdbclient\readyourwrites.actor.cpp" Line="1210" Condition="true" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310613.301672" Type="Net2Running" ID="0000000000000000" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310613.301672" Type="ProcessTimeOffset" ID="0000000000000000" ProcessTime="1637310614.037694" SystemTime="1637310614.040071" OffsetFromSystemTime="-0.002378" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310614.037822" Type="ConnectingTo" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="9.30.211.106:4600:tls" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310614.038925" Type="TaskBucketMetrics" ID="e836c617a848bc7d" Elapsed="0" DispatchSlotChecksStarted="0 0 0" DispatchErrors="0 0 0" DispatchDoTasks="0 0 0" DispatchEmptyTasks="0 0 0" DispatchSlotChecksComplete="0 0 0" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310614.537789" Type="SomewhatSlowRunLoopTop" ID="0000000000000000" Elapsed="0.249753" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310615.539787" Type="SomewhatSlowRunLoopTop" ID="0000000000000000" Elapsed="0.249978" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310615.789794" Type="SomewhatSlowRunLoopTop" ID="0000000000000000" Elapsed="0.249885" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310616.037829" Type="ConnectionTimedOut" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="9.30.211.106:4600:tls" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310616.037829" Type="NotifyAddressFailed" ID="0000000000000000" SuppressedEventCount="0" Address="9.30.211.106:4600:tls" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310616.037829" Type="ConnectionClosed" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="0" PeerAddr="9.30.211.106:4600:tls" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="20" Time="1637310616.037829" Type="N2_ConnectError" ID="0a091d0298673b0e" SuppressedEventCount="0" ErrorCode="995" Message="The I/O operation has been aborted because of either a thread exit or an application request" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310616.039718" Type="ConnectingTo" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="9.30.211.106:4600:tls" Machine="9.30.211.106:8748" LogGroup="default" />

<Event Severity="10" Time="1637310617.039746" Type="AlreadyDisconnected" ID="0000000000000000" Addr="9.30.211.106:4600:tls" Tok="ffffffffffffffff" Machine="9.30.211.106:8748" LogGroup="default" />

What do the logs look like on the server side around when the client was connecting?

The client side of this reads like the server rejected the client for some reason, and there’s likely a TLS error or connection error in the server logs that would say why.

Thank you @alexmiller for looking into it. The client and server are on the same machine. So the log I got is from C:\ProgramData\foundationdb\logs, so this is server log right?
do we have like debug parameter to be enabled to get more info?
I couldn’t find other log.
From the current setup, is everything so far all good?

I set it up on linux, and got this error:

<Event Severity="40" ErrorKind="Unset" Time="1637509624.972362" DateTime="2021-11-21T15:47:04Z" Type="FDBLibTLSVerifyPeersParseError" ID="0000000000000000" Config="&quot;Check.Valid=0&quot;" ThreadID="7172399252734082668" Backtrace="addr2line -e backup_agent.debug -p -C -f -i 0xbaeabc 0xbafae0 0xbafe71 0x4bd376 0xba453e 0xb53838 0x906f0f 0x4e1727 0x4c868e 0x7fcd8bc8d555" Machine="9.21.110.43:12821" LogGroup="default" />
<Event Severity="10" Time="1637509624.972362" DateTime="2021-11-21T15:47:04Z" Type="Net2TLSInitError" ID="0000000000000000" Error="tls_error" ErrorDescription="TLS error" ErrorCode="2107" ThreadID="7172399252734082668" Machine="9.21.110.43:12821" LogGroup="default" />

check foundationdb.conf

tls_certificate_file = /etc/foundationdb/fdb.pem
tls_ca_file = /etc/foundationdb/cert.crt
tls_key_file = /etc/foundationdb/private.key
tls_verify_peers = Check.Valid=0

## An individual fdbserver process with id 4500
## Parameters set here override defaults from the [fdbserver] section
[fdbserver.4600:tls]
[backup_agent]
command = /usr/lib/foundationdb/backup_agent/backup_agent --tls_certificate_file=/etc/foundationdb/fdb.pem --tls_ca_file=/etc/foundationdb/cert.crt --tls_key_file=/etc/foundationdb/private.key --tls_verify_peers="Check.Valid=0"
logdir = /var/log/foundationdb

what might be wrong with the file? thank you.

also

 service foundationdb status -l
Redirecting to /bin/systemctl status  -l foundationdb.service
● foundationdb.service - FoundationDB Key-Value Store
   Loaded: loaded (/usr/lib/systemd/system/foundationdb.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2021-11-21 08:03:22 PST; 2min 8s ago
  Process: 14216 ExecStart=/usr/lib/foundationdb/fdbmonitor --conffile /etc/foundationdb/foundationdb.conf --lockfile /var/run/fdbmonitor.pid --daemonize (code=exited, status=0/SUCCESS)
 Main PID: 14217 (fdbmonitor)
    Tasks: 8
   CGroup: /system.slice/foundationdb.service
           ├─14217 /usr/lib/foundationdb/fdbmonitor --conffile /etc/foundationdb/foundationdb.conf --lockfile /var/run/fdbmonitor.pid --daemonize
           ├─14218 /usr/lib/foundationdb/backup_agent/backup_agent --tls_certificate_file /etc/foundationdb/fdb.pem --tls_ca_file /etc/foundationdb/cert.crt --tls_key_file /etc/foundationdb/private.key --tls_verify_peers "Check.Valid=0" --cluster_file /etc/foundationdb/fdb.cluster --logdir /var/log/foundationdb
           └─14417 /usr/lib/foundationdb/fdbmonitor --conffile /etc/foundationdb/foundationdb.conf --lockfile /var/run/fdbmonitor.pid --daemonize

Nov 21 08:03:22 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Process 14219 exited 1, restarting in 0 seconds
Nov 21 08:03:22 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Launching /usr/sbin/fdbserver (14220) for fdbserver.4600
Nov 21 08:03:23 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": ERROR: TLS state of public address auto:4600 does not match in coordinator list.
Nov 21 08:03:23 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Process 14220 exited 1, restarting in 60 seconds
Nov 21 08:04:23 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Launching /usr/sbin/fdbserver (14226) for fdbserver.4600
Nov 21 08:04:23 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": ERROR: TLS state of public address auto:4600 does not match in coordinator list.
Nov 21 08:04:23 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Process 14226 exited 1, restarting in 65 seconds
Nov 21 08:05:28 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Launching /usr/sbin/fdbserver (14312) for fdbserver.4600
Nov 21 08:05:28 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": ERROR: TLS state of public address auto:4600 does not match in coordinator list.
Nov 21 08:05:28 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Process 14312 exited 1, restarting in 60 seconds

I removed quot for backup_agent command, then got below error:


<?xml version="1.0"?>
<Trace>
<Event Severity="10" Time="1637548044.056996" DateTime="2021-11-22T02:27:24Z" Type="Net2Starting" ID="0000000000000000" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.057734" DateTime="2021-11-22T02:27:24Z" Type="ProgramStart" ID="0000000000000000" SourceVersion="903f8dfe10139eda94eea31343d83c0385e032e0" Version="6.3.22" PackageName="6.3" ActualTime="1637548044" CommandLine="/usr/lib/foundationdb/backup_agent/backup_agent --tls_certificate_file=/etc/foundationdb/fdb.pem --tls_ca_file=/etc/foundationdb/cert.crt --tls_key_file=/etc/foundationdb/private.key --tls_verify_peers=Check.Valid=1 --cluster_file=/etc/foundationdb/fdb.cluster --logdir=/var/log/foundationdb" MemoryLimit="8589934592" ThreadID="7481557729313328746" TrackLatestType="Original" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.057169" DateTime="2021-11-22T02:27:24Z" Type="ClientStart" ID="0000000000000000" SourceVersion="903f8dfe10139eda94eea31343d83c0385e032e0" Version="6.3.22" PackageName="6.3" ClusterFile="/etc/foundationdb/fdb.cluster" ConnectionString="tlsdb:tlsdb@9.21.110.43:4600:tls" ActualTime="1637548044" ApiVersion="-1" ImageOffset="(nil)" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" TrackLatestType="Original" />
<Event Severity="10" Time="1637548044.057169" DateTime="2021-11-22T02:27:24Z" Type="MachineLoadDetail" ID="0000000000000000" User="3621496" Nice="94294" System="4059454" Idle="2501549489" IOWait="194462" IRQ="0" SoftIRQ="118166" Steal="109750" Guest="0" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.057169" DateTime="2021-11-22T02:27:24Z" Type="Net2TLSConfig" ID="0000000000000000" CAPath="/etc/foundationdb/cert.crt" CertificatePath="/etc/foundationdb/fdb.pem" KeyPath="/etc/foundationdb/private.key" HasPassword="0" VerifyPeers="Check.Valid=1" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.057169" DateTime="2021-11-22T02:27:24Z" Type="GetMagazineSample" ID="0000000000000000" Size="1024" Backtrace="addr2line -e backup_agent.debug -p -C -f -i 0xbaeabc 0xb37be8 0xb37e59 0x84547f 0x845aff 0x845fe5 0x9067c4 0x4e1727 0x4c868e 0x7f5b50927555" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.057169" DateTime="2021-11-22T02:27:24Z" Type="NotifyAddressHealthy" ID="0000000000000000" SuppressedEventCount="0" Address="9.21.110.43:4600:tls" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.057169" DateTime="2021-11-22T02:27:24Z" Type="CodeCoverage" ID="0000000000000000" File="/root/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbclient/ReadYourWrites.actor.cpp" Line="1357" Condition="true" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.057169" DateTime="2021-11-22T02:27:24Z" Type="StartingRunLoopProfilingThread" ID="0000000000000000" Interval="0.125" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.057169" DateTime="2021-11-22T02:27:24Z" Type="Net2Running" ID="0000000000000000" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.057169" DateTime="2021-11-22T02:27:24Z" Type="ProcessTimeOffset" ID="0000000000000000" ProcessTime="1637548044.063385" SystemTime="1637548044.063388" OffsetFromSystemTime="-0.000003" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.063433" DateTime="2021-11-22T02:27:24Z" Type="ConnectingTo" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="9.21.110.43:4600:tls" PeerReferences="2" FailureStatus="OK" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.063433" DateTime="2021-11-22T02:27:24Z" Type="TaskBucketMetrics" ID="88c7c58fedd3f118" Elapsed="0" DispatchSlotChecksStarted="0 -1 0" DispatchErrors="0 -1 0" DispatchDoTasks="0 -1 0" DispatchEmptyTasks="0 -1 0" DispatchSlotChecksComplete="0 -1 0" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="20" Time="1637548044.063433" DateTime="2021-11-22T02:27:24Z" Type="N2_ConnectError" ID="da4ce54e46480a51" SuppressedEventCount="0" ErrorCode="111" Message="Connection refused" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.063433" DateTime="2021-11-22T02:27:24Z" Type="ConnectionTimedOut" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="9.21.110.43:4600:tls" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />
<Event Severity="10" Time="1637548044.063433" DateTime="2021-11-22T02:27:24Z" Type="ConnectionClosed" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="0" PeerAddr="9.21.110.43:4600:tls" ThreadID="7481557729313328746" Machine="9.21.110.43:590" LogGroup="default" />

I managed to fix some error. the current error is:
<Event Severity="20" Time="1637574961.284324" DateTime="2021-11-22T09:56:01Z" Type="N2_AcceptHandshakeError" ID="0000000000000000" ErrorCode="336151576" ErrorMsg="tlsv1 alert unknown ca" BackgroundThread="1" ThreadID="16224195623213777159" Machine="9.21.110.43:4500" LogGroup="default" Roles="CC,CD,DD,MP,MS,RK,RV,SS,TL" />

What does this indicate? Thank you.

I hadn’t noticed the C:\ before. I have a vague memory that TLS isn’t supported on Windows. That might have changed, or my memory might be wrong, but I’m nearly certain it isn’t tested on windows. I wouldn’t be surprised if there’s some windows-specific bugs.

This indeed appears that however you were passing tls_verify_peers to the process, an extra layer of double quotes ended up in the argument string. It looks like you managed to work through that though.

This is a standard OpenSSL error message that you can google elsewhere as well. It looks like the server doesn’t trust the root CA of the certificate chain that the client presented.

This appears to be that either your fdb.pem doesn’t contain the full certification chain, up to and including the cert.crt certificate, or that your cert.crt certificate is a different certificate than the root one in fdb.pem.

I once tried to implement a debugging tool for this, so you can try running
fdbcli --tls_certificate_file=/etc/foundationdb/fdb.pem --tls_ca_file=/etc/foundationdb/cert.crt --tls_key_file=/etc/foundationdb/private.key --debug-tls and see if the output helps at all.

Otherwise, you can google for the normal openssl ways to inspect certificates and try to figure out why OpenSSL doesn’t trust the CA.

1 Like

Thank you @alexmiller for all the informational analysis. I managed it to work.
Here is the foundationdb.conf for others as a reference in case same issue:

[fdbserver]
tls_certificate_file = /etc/foundationdb/fdb.pem
tls_key_file = /etc/foundationdb/private.key
tls_verify_peers = Check.Valid=0

nothing changed under [backup_agent]
use below to restart fdb server instead of using service
fdbserver -C /etc/foundationdb/fdb.cluster -p 9.21.110.43:4500:tls --tls_certificate_file /etc/foundationdb/fdb.pem --tls_key_file /etc/foundationdb/private.key --tls_verify_peers "Check.Valid=0"