Issues with V6.2 TLS Cluster

I am able to bring up non-TLS clusters with FDB v6.2, but have difficulties to create a TLS v6.2 cluster.

By following instructions in https://apple.github.io/foundationdb/tls.html, we tried two approaches:

  • Setting Up FoundationDB to use TLS
  • Converting an existing cluster to use TLS (since v6.1).

Neither succeeded. Here are some details.

Our env:
FDB v6.2.10
Ubuntu 18.04
Using Docker containers with Kubernetes.
Each Storage K8s pod has two processes at 4500 and 4501.

  1. “Setting Up FoundationDB to use TLS” approach

When using “-t” flag on make_public.py (at the Docker container startup), we see the coordinator in fdb.cluster file has “:tls” suffix, e…g,
4QzbRd0m:ItkkDDfp@10.104.193.12:4500:tls

However, the processes on the container seem abnormal. There are 3 fdbmonitor processes, but only 1 fdbserver process (supposed to have 2 at 4500 and 4501), as follows:

root@tls62-storage-01-76cd5f5b99-trqhh:~# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         6  0.0  0.0  12644  1052 ?        S    07:26   0:00 /usr/lib/foundationdb/fdbmonitor
root        20  0.0  0.0  21076  3016 ?        Ss   07:26   0:00 /usr/lib/foundationdb/fdbmonitor --conffile /etc/foundationdb/foundationdb.conf --lockfile /var/run/fdbmonitor.pid --daemonize
foundat+    34  0.1  0.0 331732 19420 ?        Sl   07:26   0:00 /usr/sbin/fdbserver --class storage --cluster_file /var/lib/foundationdb/fdb.cluster --datacenter_id dc1 --datadir /var/lib/foundat
root        40  0.0  0.0  34448 12200 ?        Sl   07:26   0:00 fdbcli --no-status --exec configure new single memory
root        47  0.0  0.0  21076   248 ?        S    07:30   0:00 /usr/lib/foundationdb/fdbmonitor --conffile /etc/foundationdb/foundationdb.conf --lockfile /var/run/fdbmonitor.pid --daemonize

Also there is a trace file under /var/log/foundation associated with 4501, but none for 4500.

root@tls62-storage-01-76cd5f5b99-trqhh:/var/log# ll foundationdb/
total 1604
-rw-r--r-- 1 foundationdb foundationdb     743 Jan 23 07:26 trace.10.104.193.12.22.1579764417.yVrDKA.0.1.xml
-rw-r--r-- 1 foundationdb foundationdb 1457778 Jan 23 07:42 trace.10.104.193.12.4501.1579764417.KoUk1T.0.1.xml

The resulting db is unavailable.

  1. “Converting an existing cluster to use TLS (since v6.1)” approach
    Taking out the “-t” flag of make_public.py, we created the whole cluster in non-TLS first, then convert it to TLS, following the instructions in the doc.

Restarting fdbserver gave me this error:

root@tls62-storage-01-76cd5f5b99-c9w22:~# fdbserver -C /var/lib/foundationdb/fdb.cluster -p 10.104.193.12:4500 -p 10.104.193.12:4501:tls

    Error initializing networking with public address 10.104.193.12:4500 and listen address 10.104.193.12:4500 (Local address in use)

    Try `fdbserver --help' for more information.

Please help. Thank you.

Leo

fdbmonitor starts child processes by forking itself. In the cases where the child process fails and requires a backoff, you’ll see these extra fdbmonitor processes before it execs the child executable again. Check the fdbmonitor logs (in syslog) for information about why the child processes are failing.

Is there any other process already using the 4500 port? Or did this happen to eventually clear itself up?

Hi AJ,

I checked the port 4500. I believe it’s not in use.

root@tls62wt222-storage-01-7fdf658588-m8w5l:~# netstat -peanut
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode       PID/Program name
tcp        0      0 10.104.193.72:4501      0.0.0.0:*               LISTEN      105        2489893358  -               
root@tls62wt222-storage-01-7fdf658588-m8w5l:~# 

Note: You may notice the IP above is different from my previous post. That’s becuase I rebuilt a new Docker image with the netstat utility in order to see ports in use, and created a new fdb cluster (tls62wt222).

From the fdb trace files, we suspect the following 2 issues:

  • The “4500:tls” process was not able to be started.
  • The “make_public.py -t” cmd made “4500:tls”, but didn’t make 4501 as tls.

I’d like upload the trace files for your reference, but it seems only image files can be uploaded to the forum.

Thank you.

Leo

The documentation on TLS enabling is pretty short. The examples used seem to be for a cluster of a single node at 127.0.0.1.

Some questions on how to apply the procedures to a multi-node cluster. On “Enabling TLS in a new cluster” section:

  • Do I run “make_public.py -t” only on the first node or on every node?
  • When a node has multiple processes, how can I enable TLS on all processes? In my previous msg, it seems the server restart only affects processes listed in fdb.cluster as coords.

Thanks.
Leo

@Apple Do you have a more detailed doc on TLS enabing?

@Community Has any of you enabled a multi-node cluster of v6.2 successfully? I’ll appreciate your sharing your experience.

Leo

make_public.py is intended to be run on a single node cluster as you are preparing to build it up (see Building a Cluster — FoundationDB 7.1). A roughly similar procedure (though I’m not sure whether make_public can help) is the pre-6.1 steps:

https://apple.github.io/foundationdb/tls.html?highlight=make_public#converting-an-existing-cluster-to-use-tls-v6-1

Processes use TLS if the coordinators in the cluster file are configured to use TLS (note that you’ll also need to configure any TLS related parameters for your key files, etc, on every process). Apparently in the mixed mode, you’ll configure two addresses so that you can talk to TLS and non-TLS coordinators, but if you are only using TLS coordinators then you should only need one address.

The documentation is a little confusing here, and I personally haven’t done anything with converting clusters between non-TLS and TLS, so I’m not sure if there are any hidden gotchas that I’m unaware of.

I have difficulties to configure the mixed mode. Simply switching one coord to tls will make the db unavailable.

To speed up and to be able to check syslog, I moved to test with a single-node, multi-process cluster on a VM. (Docker containers disable syslogs.) Still have issues with this env.

I started w one coord (“fdbserver.4501:tls”) in TLS and restarted fdb service, then the corresponding process for 450 would not start. Certificate-related files have been configured in foundationdb.conf. There was a error in syslog:
ERROR: TLS state of public address auto:4501 does not match in coordinator list.

The error is mentioned in a closed bug report for v6.1. Not sure why I still see it.

If I revert all coords to non-TLS, then the coords are reachable, but the db is still unavailable.

fdb> status details 
Using cluster file `/etc/foundationdb/fdb.cluster'.

The coordinator(s) have no record of this database. Either the coordinator
addresses are incorrect, the coordination state on those machines is missing, or
no database has been created.
  10.169.165.52:4501  (reachable)
  10.169.165.52:4502  (reachable)
  10.169.165.52:4503  (reachable)

Unable to locate the data distributor worker.
Unable to locate the ratekeeper worker.

I had started a total of 10 fdb processes. Why are Data Distributor and Ratekeeper workers not available?

Versions I am using: FDB v6.2.11 on Ubuntu 18.04.

Thanks.
Leo

The current fdb relies upon the coordinators for TLS settings, instead of having an independent flag or flags for TLS. I don’t quite understand the design rationals. Can you please explain them and give more details on how the current mechanism work? Thank you.

Leo

This isn’t clear to me from the documentation, but it appears you’ll need to specify your public address in a format that is consistent with the TLS state of the process in the coordinators list. In other words, if you have a process 127.0.0.1:4500:tls in your cluster file, that process needs to have the public address 127.0.0.1:4500:tls. If you don’t include :tls in the process’s public address, you’ll get this error.

It’s possible you’ll need to update all of your processes to have a :tls public address (possibly in addition to a non-TLS one).

I’m not sure the history of what’s been done to this database, but at this point there seems to be no data (at least with respect to the coordinators’ state). Hopefully this is a case where you never had data, in which case you can setup the database using configure new <storage engine> <replication mode>, but if you had data and lost it we’d have to try to figure out what happened. For example, if you’ve been moving around coordinators without using fdbcli or manually manipulating files, it’s possible the cluster is no longer aware of the data that’s there.

I restarted everything from scratch (uninstalled and reinstalled fdb software), took your advice to keep coord spec and the process’ public address consistent. I’m able to go a step further.

I started w 1 process at 4500. The db was available. Then ran sudo /usr/lib/foundationdb/make_public.py -t. Afterwards, I needed to restart the 4500 process w tls public address as follows:

sudo kill 6580. # kill the process first

sudo -u foundationdb /usr/sbin/fdbserver -C /etc/foundationdb/fdb.cluster --datadir /var/lib/foundationdb/data/4500 --listen_address public --logdir /var/log/foundationdb --public_address auto:4500:tls --tls_ca_file /etc/foundationdb/ca-cert.cert --tls_certificate_file /etc/foundationdb/cert.cert --tls_key_file /etc/foundationdb/key-file.key &

The db was OK.

I added 2 more processes 4501 and 4502 to foundationdb.conf. The db was OK with 1 TLS process and 2 non-TLS processes:

Process performance details:
      10.169.165.52:4501     (  2% cpu;  2% machine; 0.000 Gbps;  0% disk IO; 0.4 GB / 2.8 GB RAM  )
      10.169.165.52:4502     (  3% cpu;  2% machine; 0.000 Gbps;  0% disk IO; 0.4 GB / 2.8 GB RAM  )
      10.169.165.52:4500:tls (  3% cpu;  2% machine; 0.000 Gbps;  0% disk IO; 0.4 GB / 2.8 GB RAM  )
Coordination servers:
      10.169.165.52:4500:tls  (reachable)

Then I added TLS to 4501 in fdb.cluster, killed the old 4501 process, and restarted it w TLS:

sudo -u foundationdb /usr/sbin/fdbserver -C /etc/foundationdb/fdb.cluster -d /var/lib/foundationdb/data/4501 -L /var/log/foundationdb -p auto:4501:tls --tls_ca_file /etc/foundationdb/ca-cert.cert --tls_certificate_file /etc/foundationdb/cert.cert --tls_key_file /etc/foundationdb/key-file.key &

It took a couple of mins for the db to stabilize. The db was available, but I got this status details output with a warning/error for 4500:

Process performance details:
  10.169.165.52:4502     (  2% cpu;  3% machine; 0.000 Gbps;  0% disk IO; 0.4 GB / 2.8 GB RAM  )
  10.169.165.52:4500:tls (  2% cpu;  3% machine; 0.000 Gbps;  0% disk IO; 0.4 GB / 2.8 GB RAM  )
    Cluster file contents do not match current cluster connection string. Verify the cluster file and its parent directory are writable and that the cluster file has not been overwritten externally.
  10.169.165.52:4501:tls (  3% cpu;  3% machine; 0.000 Gbps;  0% disk IO; 0.4 GB / 2.8 GB RAM  )
Coordination servers:
  10.169.165.52:4500:tls  (reachable)
  10.169.165.52:4501:tls  (reachable)

What could have caused the warning/error “Cluster file contents do not match current cluster connection string”?

Thanks
Leo

After I enabled 4502 for TLS (done similarly as for 4501), I got this

fdb> status
Using cluster file `/etc/foundationdb/fdb.cluster'.
Locking coordination state. Verify that a majority of coordination server
processes are active.
  10.169.165.52:4500:tls  (reachable)
  10.169.165.52:4501:tls  (reachable)
  10.169.165.52:4502:tls  (reachable)

Unable to locate the data distributor worker.
Unable to locate the ratekeeper worker.

fdb> coordinators 
Cluster description: C0klmsUk
Cluster coordinators (3): 10.169.165.52:4500:tls,10.169.165.52:4501:tls,10.169.165.52:4502:tls

The error message has a few suggestions for this – namely that the permissions of the file or its folder are insufficient or that the file may have been modified externally. Have you ruled those out? In particular, if you are manually editing your cluster file, then you’ll probably need to do so for every process in your cluster and then restart all of the processes.

If you are going the route of converting your cluster to TLS when it is a single process, then my recommendation would be to add only TLS processes to it afterward. Although it should be possible to convert the cluster to mixed mode and back again, I don’t think it’s necessary that non-TLS processes be added.