Nodes not joining cluster, `coordinators auto` hanging

danthegoodman · March 10, 2024, 1:07pm

I’ve created 2 ubuntu machines on hetzner cloud with port 4500 open. I installed the 7.3.27-1 client then server on each, and ran the make_public script on one server.

I copied that cluster file to the other server, replacing the existing cluster file, and restarted the server. I am now observing the following 2 issues:

Both servers show the following with status command:

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - memory
  Log engine             - ssd-2
  Encryption at-rest     - disabled
  Coordinators           - 1
  Desired Commit Proxies - 3
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 1 (less 0 excluded; 1 with errors)
  Zones                  - 1
  Machines               - 1
  Memory availability    - 8.0 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 03/10/24 13:28:50

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 0 MB
  Disk space used        - 105 MB

Operating space:
  Storage server         - 1.0 GB free on most full server
  Log server             - 144.1 GB free on most full server

Workload:
  Read rate              - 36 Hz
  Write rate             - 1 Hz
  Transactions started   - 16 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 03/10/24 13:28:49

I would expect the number of machines to be 2, both are happy and no issues in service foundationdb status either:

Mar 10 13:02:04 ubuntu-16gb-hil-1 fdbmonitor[1218]: LogGroup="default" Process="fdbmonitor": Loading configuration /etc/foundationdb/foundationdb.conf
Mar 10 13:02:04 ubuntu-16gb-hil-1 fdbmonitor[1218]: LogGroup="default" Process="fdbmonitor": Starting backup_agent.1
Mar 10 13:02:04 ubuntu-16gb-hil-1 fdbmonitor[1218]: LogGroup="default" Process="fdbmonitor": Starting fdbserver.4500
Mar 10 13:02:04 ubuntu-16gb-hil-1 fdbmonitor[1218]: LogGroup="default" Process="fdbserver.4500": Launching /usr/sbin/fdbserver (1220) for fdbserver.4500
Mar 10 13:02:04 ubuntu-16gb-hil-1 fdbmonitor[1218]: LogGroup="default" Process="backup_agent.1": Launching /usr/lib/foundationdb/backup_agent/backup_agent (1>
Mar 10 13:02:05 ubuntu-16gb-hil-1 fdbmonitor[1218]: LogGroup="default" Process="fdbserver.4500": FDBD joined cluster.

(this is from the non-coordinator server)

I see the same behavior when adding a third server, not joining the cluster, same status output. I can confirm that curl -vvv telnet://<coord_ip>:4500 shows that I can connect, so it is definitely listening.

coordinators auto command hangs:

fdb> coordinators auto

WARNING: Long delay (Ctrl-C to interrupt)

The database is available.

Just sits there forever

danthegoodman · March 10, 2024, 1:12pm

Also following Building a Cluster — FoundationDB 7.1 I get this which is unexpected when following the linked guide

fdb> configure single ssd
ERROR: Storage engine type cannot be changed because storage_migration_type=disabled.
Type `configure perpetual_storage_wiggle=1 storage_migration_type=gradual' to enable gradual migration with the perpetual wiggle, or `configure storage_migration_type=aggressive' for aggressive migration.

Seems that needs to be updated. It also didn’t mention installing the server after the client, and the linked version for downloads is 6.x: Downloads — FoundationDB 7.1

I’ve also tried doing this on digitalocean, same results.

danthegoodman · March 10, 2024, 2:18pm

I noticed the error on the second node, and found:
<Event Severity="40" ErrorKind="Unset" Time="1710079989.064443" DateTime="2024-03-10T14:13:09Z" Type="ZombieProcess" ID="0000000000000000" Error="invalid_cluster_id" ErrorDescription="Attempted to join cluster with a different cluster ID" ErrorCode="1217" ThreadID="1366905691540591893" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x545d6fd 0x545d9c3 0x5457bc4 0x355fc91 0x3561f17 0x1b7d039 0x1b7d039 0x225be87 0x225c0c7 0x1b7d039 0x354211c 0x351afb9 0x53e3998 0x32202f5 0x7f951658ed90" Machine="5.161.227.51:4500" LogGroup="default" />

I’ve verified the cluster file is the same on all machines, so not sure where it’s sourcing this cluster ID from otherwise.

danthegoodman · March 10, 2024, 2:21pm

Ugh, I had to service foundationdb stop, rm -rf /var/lib/foundationdb/data/, service foundationdb start to solve it.

The building a cluster guide should either mention that, or that the cluster file should be copied BEFORE starting fdb on the new server.

coordinators auto is still hanging. It works if I manually specify (maybe it’s hanging because it can’t reach a desired count? If so that should probably timeout and log that?)

Topic		Replies	Views
Coordinators unavailable when 1 node out of 3 is down in 'single' redudancy mode? Running FoundationDB	1	269	September 20, 2023
Troubles scaling up the cluster Using FoundationDB	31	3729	November 1, 2018
Cluster in stuck Using FoundationDB	1	193	July 18, 2024
Locking coordination state. Verify that a majority of coordinattion server process are active. Single machine Using FoundationDB	4	1169	March 8, 2021
Image upgrade in FDB cluster Using FoundationDB	0	200	June 23, 2023

Nodes not joining cluster, `coordinators auto` hanging

Related topics