I have two machines set up with 4 processes each. Why does fdbcli --exec status details only report 7 processes?
-or-
Why does the process 172.17.0.2:4500 not show up?
The cluster was configured using fdbcli --exec configure new double ssd
This is just a test configuration to play around. I know it is sub-optimal.
The most likely explanation is that the result is temporary, and if you do a status again the process will show up.
Processes do not register themselves with the cluster until after they have recovered their files off disk, so immediately after rebooting a cluster status may not have information from all processes.
As a side note, there is almost no reason to run an even number of coordinators. By having two coordinators if either of them fail your cluster will be down, which is strictly worse than just having one coordinator.
My guess is that you have a different fdbsever process running on that port. Maybe a default install of the server is still running. You may find more clues if you grep the trace files for Severity 40 trace messages.
Specifically, the thing to check for is that the fdbserver process running on port 4500 is using the correct cluster file. If it’s using a different cluster file, then it’s actually part of a different database.
I am similarly seeing processes missing that are configured in the foundationdb.conf file, even after waiting. Here is a status details and a conf file that show this behavior:
fdb> status details
Using cluster file `/usr/local/etc/foundationdb/fdb.cluster'.
Configuration:
Redundancy mode - single
Storage engine - ssd-2
Coordinators - 1
Desired Logs - 1
Cluster:
FoundationDB processes - 2
Machines - 2
Memory availability - 4.4 GB per process on machine with least available
Fault Tolerance - 0 machines
Server time - 05/07/18 10:40:45
Data:
Replication health - Healthy (Rebalancing)
Moving data - 0.002 GB
Sum of key-value sizes - 78 MB
Disk space used - 173 MB
Operating space:
Storage server - 80.0 GB free on most full server
Log server - 80.0 GB free on most full server
Workload:
Read rate - 21 Hz
Write rate - 3 Hz
Transactions started - 6 Hz
Transactions committed - 1 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
127.0.0.1:4689 ( 8% cpu; 31% machine; 0.000 Gbps; 0% disk IO; 6.3 GB / 6.5 GB RAM )
127.0.0.1:4691 ( 12% cpu; 34% machine; 0.000 Gbps; 0% disk IO; 4.2 GB / 4.4 GB RAM )
Coordination servers:
127.0.0.1:4689 (reachable)
Client time: 05/07/18 10:40:45
fdb> exclude
There are currently no servers excluded from the database.
To learn how to exclude a server, type `help exclude'.
fdb>
foundationdb.conf:
$ cat /usr/local/etc/foundationdb/foundationdb.conf
## foundationdb.conf
##
## Configuration file for FoundationDB server processes
## Full documentation is available at
## https://www.foundationdb.org/documentation/configuration.html#foundationdb-conf
[general]
restart_delay = 60
## by default, restart_backoff = restart_delay_reset_interval = restart_delay
# initial_restart_delay = 0
# restart_backoff = 60
# restart_delay_reset_interval = 60
cluster_file = /usr/local/etc/foundationdb/fdb.cluster
# kill_on_configuration_change = true
## Default parameters for individual fdbserver processes
[fdbserver]
command = /usr/local/libexec/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /usr/local/foundationdb/data/$ID
logdir = /usr/local/foundationdb/logs
# logsize = 10MiB
# maxlogssize = 100MiB
# machine_id =
# datacenter_id =
# class =
# storage_memory = 1GiB
# metrics_cluster =
# metrics_prefix =
memory = 3GiB
## An individual fdbserver process with id 4689
## Parameters set here override defaults from the [fdbserver] section
[fdbserver.4689]
[fdbserver.4690]
[fdbserver.4691]
[backup_agent]
command = /usr/local/foundationdb/backup_agent/backup_agent
logdir = /usr/local/foundationdb/logs
[backup_agent.1]
There is nothing listening on 4690 and no log files show up.
In your case, it seems like there isn’t a process running at all. If you don’t see any trace logs for the process, then I would check the fdbmonitor logs (stored in syslog) for any details about why that process may not be starting.
Does killing the process and letting fdbmonitor restart it change anything?
There are a couple places we could look for more clues. One would be the full status json document (obtained by running status json in fdbcli), although if the process isn’t present in the cluster this may not yield much more that’s useful. The other place is in the trace logs, in particular for the process that seems to be running but isn’t present (the filename of the trace file should include the port number). You’ve determined that there aren’t any errors, but it may be worth looking through it for any other indications that something isn’t going as expected. If you wanted to post either of those here, I could take a look as well.
@ajbeamon - The process is present; It shows up as a coordinator. Unlike the other coordinator, this one does not seem to contribute towards the FoundationDB processes count.
I tried killing the targeted fdbserver process, it came back up but still does not show up under process performance details
Interestingly enough, the first process of the first machine (172.17.0.2:4500) disappears when the second machine joins the cluster. Could it get somewhat promoted in another role?
I tried setting the second process of the second machine as the coordinator and this time, the first process of the second machine does not show up. It has logs and they look normal. The missing process also does not show up in the status json. (search for 172.17.0.3:4500) Funny enough, the coordinator process (172.17.0.2:4501) shows up in process performance this time.
Here are the logs and status
All processes have the DeviceNotFound event. From looking at the code, I assume that’s because of a missing Linux capability (on the container) and that it should not be related to the issue I’m currently experiencing. Oh, a maybe useful piece of information: I’m using the downloadable deb package. I can try from master tomorrow and see if the behavior still arises.
If you have any more ideas based on this, please let me know!
Okay, installing the Debian package seems to run some kind of initialization which creates a data directory that seems to be bound to a particular cluster id.
I was providing a cluster file with a different cluster ID and one of my worker had the same worker id (4500) as the install-time-created one.
Fix: trash the data directory created when installing the package. (Could probably have used a different worker id)
Should fdbserver (and maybe fdbmonitor) have stopped (or crashed) under such circumstances to let me know that this wrongdoing was happening? Did I miss something?
Should installing the package start an fdbserver in order to initialize a data or should the user be tasked with running fdbcli --exec configure new ...?
I don’t know of any reason why the data files would prevent you from joining a different cluster, unless for some reason they couldn’t be recovered (a process won’t join the cluster until it’s recovered its data files). It should be noted that a process with data files for one cluster will delete those files when it joins a different cluster, though.
It may be worth raising an issue in GitHub to try to reproduce and investigate what’s going on here.
Are you by chance using the exact same installed data files for both of your machines? I was able to reproduce this behavior by duplicating a process’s processId file to other processes in a cluster. If your machines are coming from some image with the same processId file created by the installer for the process at port 4500, then the cluster controller will essentially shut out all but one of those processes from the cluster. However, the other processes will keep running, so they can fulfill their role as coordinator if necessary.
If you are running into the situation I described, then I think deleting the processId file should be sufficient to solve the problem. When your process starts up, it will create a new one to replace it. However, it’s also fine to delete all of the data directory in your image if that makes sense for your use-case.
“Should installing the package start an fdbserver in order to initialize a data or should the user be tasked with running fdbcli --exec configure new …”
Imho it would be nice to let the user choose during installation whether a default server should be configured ( maybe switched off via an environment variable ).
That would make things more easy for container environments ( according to my current experience with Kubernetes ).
By the way, FoundationDB 3.x was able to form a cluster using nodes from the same image. I used to have scripts (packer+terraform) that did exactly that. Something has probably changed between 3.x and 5.x that altered this behaviour.
That’s right. In 4.0, a feature was added to support setting machine classes through fdbcli. It was desired that this machine class follow the process around even if its data files were moved, so we renamed the existing empty .fdb-lock file and put the process ID in it.
The original data folder was created upon package installation, at container build time. So all containers had the same data folder for process *:4500.
That’s a clever reason! Thank you for the explanation, it is sensible.
Should the process being “muted” appear as so in the status details?
Thank you very much and congrats on the cool work!