The most likely explanation is that the result is temporary, and if you do a status again the process will show up.
Processes do not register themselves with the cluster until after they have recovered their files off disk, so immediately after rebooting a cluster status may not have information from all processes.
As a side note, there is almost no reason to run an even number of coordinators. By having two coordinators if either of them fail your cluster will be down, which is strictly worse than just having one coordinator.
My guess is that you have a different fdbsever process running on that port. Maybe a default install of the server is still running. You may find more clues if you grep the trace files for Severity 40 trace messages.
Specifically, the thing to check for is that the fdbserver process running on port 4500 is using the correct cluster file. If it’s using a different cluster file, then it’s actually part of a different database.
I am similarly seeing processes missing that are configured in the foundationdb.conf file, even after waiting. Here is a status details and a conf file that show this behavior:
fdb> status details
Using cluster file `/usr/local/etc/foundationdb/fdb.cluster'.
Redundancy mode - single
Storage engine - ssd-2
Coordinators - 1
Desired Logs - 1
FoundationDB processes - 2
Machines - 2
Memory availability - 4.4 GB per process on machine with least available
Fault Tolerance - 0 machines
Server time - 05/07/18 10:40:45
Replication health - Healthy (Rebalancing)
Moving data - 0.002 GB
Sum of key-value sizes - 78 MB
Disk space used - 173 MB
Storage server - 80.0 GB free on most full server
Log server - 80.0 GB free on most full server
Read rate - 21 Hz
Write rate - 3 Hz
Transactions started - 6 Hz
Transactions committed - 1 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
127.0.0.1:4689 ( 8% cpu; 31% machine; 0.000 Gbps; 0% disk IO; 6.3 GB / 6.5 GB RAM )
127.0.0.1:4691 ( 12% cpu; 34% machine; 0.000 Gbps; 0% disk IO; 4.2 GB / 4.4 GB RAM )
Client time: 05/07/18 10:40:45
There are currently no servers excluded from the database.
To learn how to exclude a server, type `help exclude'.
$ cat /usr/local/etc/foundationdb/foundationdb.conf
## Configuration file for FoundationDB server processes
## Full documentation is available at
restart_delay = 60
## by default, restart_backoff = restart_delay_reset_interval = restart_delay
# initial_restart_delay = 0
# restart_backoff = 60
# restart_delay_reset_interval = 60
cluster_file = /usr/local/etc/foundationdb/fdb.cluster
# kill_on_configuration_change = true
## Default parameters for individual fdbserver processes
command = /usr/local/libexec/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /usr/local/foundationdb/data/$ID
logdir = /usr/local/foundationdb/logs
# logsize = 10MiB
# maxlogssize = 100MiB
# machine_id =
# datacenter_id =
# class =
# storage_memory = 1GiB
# metrics_cluster =
# metrics_prefix =
memory = 3GiB
## An individual fdbserver process with id 4689
## Parameters set here override defaults from the [fdbserver] section
command = /usr/local/foundationdb/backup_agent/backup_agent
logdir = /usr/local/foundationdb/logs
There is nothing listening on 4690 and no log files show up.
In your case, it seems like there isn’t a process running at all. If you don’t see any trace logs for the process, then I would check the fdbmonitor logs (stored in syslog) for any details about why that process may not be starting.
Does killing the process and letting fdbmonitor restart it change anything?
There are a couple places we could look for more clues. One would be the full status json document (obtained by running status json in fdbcli), although if the process isn’t present in the cluster this may not yield much more that’s useful. The other place is in the trace logs, in particular for the process that seems to be running but isn’t present (the filename of the trace file should include the port number). You’ve determined that there aren’t any errors, but it may be worth looking through it for any other indications that something isn’t going as expected. If you wanted to post either of those here, I could take a look as well.
Interestingly enough, the first process of the first machine (172.17.0.2:4500) disappears when the second machine joins the cluster. Could it get somewhat promoted in another role?
I tried setting the second process of the second machine as the coordinator and this time, the first process of the second machine does not show up. It has logs and they look normal. The missing process also does not show up in the status json. (search for 172.17.0.3:4500) Funny enough, the coordinator process (172.17.0.2:4501) shows up in process performance this time.
Here are the logs and status
All processes have the DeviceNotFound event. From looking at the code, I assume that’s because of a missing Linux capability (on the container) and that it should not be related to the issue I’m currently experiencing. Oh, a maybe useful piece of information: I’m using the downloadable deb package. I can try from master tomorrow and see if the behavior still arises.
If you have any more ideas based on this, please let me know!
I don’t know of any reason why the data files would prevent you from joining a different cluster, unless for some reason they couldn’t be recovered (a process won’t join the cluster until it’s recovered its data files). It should be noted that a process with data files for one cluster will delete those files when it joins a different cluster, though.
It may be worth raising an issue in GitHub to try to reproduce and investigate what’s going on here.
Are you by chance using the exact same installed data files for both of your machines? I was able to reproduce this behavior by duplicating a process’s processId file to other processes in a cluster. If your machines are coming from some image with the same processId file created by the installer for the process at port 4500, then the cluster controller will essentially shut out all but one of those processes from the cluster. However, the other processes will keep running, so they can fulfill their role as coordinator if necessary.
If you are running into the situation I described, then I think deleting the processId file should be sufficient to solve the problem. When your process starts up, it will create a new one to replace it. However, it’s also fine to delete all of the data directory in your image if that makes sense for your use-case.
By the way, FoundationDB 3.x was able to form a cluster using nodes from the same image. I used to have scripts (packer+terraform) that did exactly that. Something has probably changed between 3.x and 5.x that altered this behaviour.
That’s right. In 4.0, a feature was added to support setting machine classes through fdbcli. It was desired that this machine class follow the process around even if its data files were moved, so we renamed the existing empty .fdb-lock file and put the process ID in it.