What do the logs look like on the server side around when the client was connecting?
The client side of this reads like the server rejected the client for some reason, and there’s likely a TLS error or connection error in the server logs that would say why.
Thank you @alexmiller for looking into it. The client and server are on the same machine. So the log I got is from C:\ProgramData\foundationdb\logs, so this is server log right?
do we have like debug parameter to be enabled to get more info?
I couldn’t find other log.
From the current setup, is everything so far all good?
service foundationdb status -l
Redirecting to /bin/systemctl status -l foundationdb.service
● foundationdb.service - FoundationDB Key-Value Store
Loaded: loaded (/usr/lib/systemd/system/foundationdb.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2021-11-21 08:03:22 PST; 2min 8s ago
Process: 14216 ExecStart=/usr/lib/foundationdb/fdbmonitor --conffile /etc/foundationdb/foundationdb.conf --lockfile /var/run/fdbmonitor.pid --daemonize (code=exited, status=0/SUCCESS)
Main PID: 14217 (fdbmonitor)
Tasks: 8
CGroup: /system.slice/foundationdb.service
├─14217 /usr/lib/foundationdb/fdbmonitor --conffile /etc/foundationdb/foundationdb.conf --lockfile /var/run/fdbmonitor.pid --daemonize
├─14218 /usr/lib/foundationdb/backup_agent/backup_agent --tls_certificate_file /etc/foundationdb/fdb.pem --tls_ca_file /etc/foundationdb/cert.crt --tls_key_file /etc/foundationdb/private.key --tls_verify_peers "Check.Valid=0" --cluster_file /etc/foundationdb/fdb.cluster --logdir /var/log/foundationdb
└─14417 /usr/lib/foundationdb/fdbmonitor --conffile /etc/foundationdb/foundationdb.conf --lockfile /var/run/fdbmonitor.pid --daemonize
Nov 21 08:03:22 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Process 14219 exited 1, restarting in 0 seconds
Nov 21 08:03:22 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Launching /usr/sbin/fdbserver (14220) for fdbserver.4600
Nov 21 08:03:23 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": ERROR: TLS state of public address auto:4600 does not match in coordinator list.
Nov 21 08:03:23 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Process 14220 exited 1, restarting in 60 seconds
Nov 21 08:04:23 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Launching /usr/sbin/fdbserver (14226) for fdbserver.4600
Nov 21 08:04:23 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": ERROR: TLS state of public address auto:4600 does not match in coordinator list.
Nov 21 08:04:23 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Process 14226 exited 1, restarting in 65 seconds
Nov 21 08:05:28 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Launching /usr/sbin/fdbserver (14312) for fdbserver.4600
Nov 21 08:05:28 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": ERROR: TLS state of public address auto:4600 does not match in coordinator list.
Nov 21 08:05:28 CDCDevLinuxFyre1.fyre.ibm.com fdbmonitor[14217]: LogGroup="default" Process="fdbserver.4600": Process 14312 exited 1, restarting in 60 seconds
I managed to fix some error. the current error is: <Event Severity="20" Time="1637574961.284324" DateTime="2021-11-22T09:56:01Z" Type="N2_AcceptHandshakeError" ID="0000000000000000" ErrorCode="336151576" ErrorMsg="tlsv1 alert unknown ca" BackgroundThread="1" ThreadID="16224195623213777159" Machine="9.21.110.43:4500" LogGroup="default" Roles="CC,CD,DD,MP,MS,RK,RV,SS,TL" />
I hadn’t noticed the C:\ before. I have a vague memory that TLS isn’t supported on Windows. That might have changed, or my memory might be wrong, but I’m nearly certain it isn’t tested on windows. I wouldn’t be surprised if there’s some windows-specific bugs.
This indeed appears that however you were passing tls_verify_peers to the process, an extra layer of double quotes ended up in the argument string. It looks like you managed to work through that though.
This is a standard OpenSSL error message that you can google elsewhere as well. It looks like the server doesn’t trust the root CA of the certificate chain that the client presented.
This appears to be that either your fdb.pem doesn’t contain the full certification chain, up to and including the cert.crt certificate, or that your cert.crt certificate is a different certificate than the root one in fdb.pem.
I once tried to implement a debugging tool for this, so you can try running fdbcli --tls_certificate_file=/etc/foundationdb/fdb.pem --tls_ca_file=/etc/foundationdb/cert.crt --tls_key_file=/etc/foundationdb/private.key --debug-tls and see if the output helps at all.
Otherwise, you can google for the normal openssl ways to inspect certificates and try to figure out why OpenSSL doesn’t trust the CA.
Thank you @alexmiller for all the informational analysis. I managed it to work.
Here is the foundationdb.conf for others as a reference in case same issue:
nothing changed under [backup_agent]
use below to restart fdb server instead of using service fdbserver -C /etc/foundationdb/fdb.cluster -p 9.21.110.43:4500:tls --tls_certificate_file /etc/foundationdb/fdb.pem --tls_key_file /etc/foundationdb/private.key --tls_verify_peers "Check.Valid=0"