How are 'contributing_workers' computed?

Hello FDB friends,

I’d like to understand better what’s going on here: https://gist.github.com/marctrem/0287203daf2dc59099282be210c7c14c

I have two machines set up with 4 processes each. Why does fdbcli --exec status details only report 7 processes?
-or-
Why does the process 172.17.0.2:4500 not show up?

The cluster was configured using fdbcli --exec configure new double ssd

This is just a test configuration to play around. I know it is sub-optimal.

Thank you very much,
Marc

The most likely explanation is that the result is temporary, and if you do a status again the process will show up.

Processes do not register themselves with the cluster until after they have recovered their files off disk, so immediately after rebooting a cluster status may not have information from all processes.

As a side note, there is almost no reason to run an even number of coordinators. By having two coordinators if either of them fail your cluster will be down, which is strictly worse than just having one coordinator.

Hi @Evan,

Thank you for replying!

It’s been an hour and it still does not show up! Also, this one is a freshly instantiated cluster. (Has not been through any kind of failure)

That’s part of what I meant by saying my configuration was sub-optimal :stuck_out_tongue:

Would you have any other guess as for why this might happen?

Thank you very much,
Marc

My guess is that you have a different fdbsever process running on that port. Maybe a default install of the server is still running. You may find more clues if you grep the trace files for Severity 40 trace messages.

Specifically, the thing to check for is that the fdbserver process running on port 4500 is using the correct cluster file. If it’s using a different cluster file, then it’s actually part of a different database.

I am similarly seeing processes missing that are configured in the foundationdb.conf file, even after waiting. Here is a status details and a conf file that show this behavior:

    fdb> status details

Using cluster file `/usr/local/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-2
  Coordinators           - 1
  Desired Logs           - 1

Cluster:
  FoundationDB processes - 2
  Machines               - 2
  Memory availability    - 4.4 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 05/07/18 10:40:45

Data:
  Replication health     - Healthy (Rebalancing)
  Moving data            - 0.002 GB
  Sum of key-value sizes - 78 MB
  Disk space used        - 173 MB

Operating space:
  Storage server         - 80.0 GB free on most full server
  Log server             - 80.0 GB free on most full server

Workload:
  Read rate              - 21 Hz
  Write rate             - 3 Hz
  Transactions started   - 6 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  127.0.0.1:4689         (  8% cpu; 31% machine; 0.000 Gbps;  0% disk IO; 6.3 GB / 6.5 GB RAM  )
  127.0.0.1:4691         ( 12% cpu; 34% machine; 0.000 Gbps;  0% disk IO; 4.2 GB / 4.4 GB RAM  )

Coordination servers:
  127.0.0.1:4689  (reachable)

Client time: 05/07/18 10:40:45

fdb> exclude
There are currently no servers excluded from the database.
To learn how to exclude a server, type `help exclude'.
fdb> 

foundationdb.conf:

$ cat /usr/local/etc/foundationdb/foundationdb.conf
## foundationdb.conf
##
## Configuration file for FoundationDB server processes
## Full documentation is available at
## https://www.foundationdb.org/documentation/configuration.html#foundationdb-conf

[general]
restart_delay = 60
## by default, restart_backoff = restart_delay_reset_interval = restart_delay
# initial_restart_delay = 0
# restart_backoff = 60
# restart_delay_reset_interval = 60
cluster_file = /usr/local/etc/foundationdb/fdb.cluster
# kill_on_configuration_change = true

## Default parameters for individual fdbserver processes
[fdbserver]
command = /usr/local/libexec/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /usr/local/foundationdb/data/$ID
logdir = /usr/local/foundationdb/logs
# logsize = 10MiB
# maxlogssize = 100MiB
# machine_id = 
# datacenter_id = 
# class = 
# storage_memory = 1GiB
# metrics_cluster =
# metrics_prefix =
memory = 3GiB

## An individual fdbserver process with id 4689
## Parameters set here override defaults from the [fdbserver] section
[fdbserver.4689]
[fdbserver.4690]
[fdbserver.4691]

[backup_agent]
command = /usr/local/foundationdb/backup_agent/backup_agent
logdir = /usr/local/foundationdb/logs

[backup_agent.1]

There is nothing listening on 4690 and no log files show up.

In your case, it seems like there isn’t a process running at all. If you don’t see any trace logs for the process, then I would check the fdbmonitor logs (stored in syslog) for any details about why that process may not be starting.

@Evan @ajbeamon -

I only have one cluster configuration file on the system and here is the process tree: https://gist.github.com/marctrem/5bd9a5363638101ddafd1a1829c7da0a

There is only one instance of fdbmonitor having fdbserver children. The configuration is the one mentioned in the first message.

EDIT 1
Also, grep -Rnw 'Severity="40"' /var/log/foundationdb does not yield anything.

I am still puzzled.

Thank you very much,
Marc

Does killing the process and letting fdbmonitor restart it change anything?

There are a couple places we could look for more clues. One would be the full status json document (obtained by running status json in fdbcli), although if the process isn’t present in the cluster this may not yield much more that’s useful. The other place is in the trace logs, in particular for the process that seems to be running but isn’t present (the filename of the trace file should include the port number). You’ve determined that there aren’t any errors, but it may be worth looking through it for any other indications that something isn’t going as expected. If you wanted to post either of those here, I could take a look as well.

In my case, killing fdbmonitor and it restarting picked up the process and started it.

@ajbeamon - The process is present; It shows up as a coordinator. Unlike the other coordinator, this one does not seem to contribute towards the FoundationDB processes count.

I tried killing the targeted fdbserver process, it came back up but still does not show up under process performance details

@Evan @ajbeamon -

Interestingly enough, the first process of the first machine (172.17.0.2:4500) disappears when the second machine joins the cluster. Could it get somewhat promoted in another role?

I tried setting the second process of the second machine as the coordinator and this time, the first process of the second machine does not show up. It has logs and they look normal. The missing process also does not show up in the status json. (search for 172.17.0.3:4500) Funny enough, the coordinator process (172.17.0.2:4501) shows up in process performance this time.

Here are the logs and status


All processes have the DeviceNotFound event. From looking at the code, I assume that’s because of a missing Linux capability (on the container) and that it should not be related to the issue I’m currently experiencing. Oh, a maybe useful piece of information: I’m using the downloadable deb package. I can try from master tomorrow and see if the behavior still arises.

If you have any more ideas based on this, please let me know!

Thanks,
Marc

EDIT 1:
Same result with master (6.0.0-0INTERNAL)

Solved:

Okay, installing the Debian package seems to run some kind of initialization which creates a data directory that seems to be bound to a particular cluster id.

I was providing a cluster file with a different cluster ID and one of my worker had the same worker id (4500) as the install-time-created one.

Fix: trash the data directory created when installing the package. (Could probably have used a different worker id)

  • Should fdbserver (and maybe fdbmonitor) have stopped (or crashed) under such circumstances to let me know that this wrongdoing was happening? Did I miss something?
  • Should installing the package start an fdbserver in order to initialize a data or should the user be tasked with running fdbcli --exec configure new ...?

Thank you very much for your valuable time!

Regards,
Marc

2 Likes

I don’t know of any reason why the data files would prevent you from joining a different cluster, unless for some reason they couldn’t be recovered (a process won’t join the cluster until it’s recovered its data files). It should be noted that a process with data files for one cluster will delete those files when it joins a different cluster, though.

It may be worth raising an issue in GitHub to try to reproduce and investigate what’s going on here.

1 Like

Are you by chance using the exact same installed data files for both of your machines? I was able to reproduce this behavior by duplicating a process’s processId file to other processes in a cluster. If your machines are coming from some image with the same processId file created by the installer for the process at port 4500, then the cluster controller will essentially shut out all but one of those processes from the cluster. However, the other processes will keep running, so they can fulfill their role as coordinator if necessary.

If you are running into the situation I described, then I think deleting the processId file should be sufficient to solve the problem. When your process starts up, it will create a new one to replace it. However, it’s also fine to delete all of the data directory in your image if that makes sense for your use-case.

2 Likes

I agree with the view of @marctrem :

“Should installing the package start an fdbserver in order to initialize a data or should the user be tasked with running fdbcli --exec configure new …”

Imho it would be nice to let the user choose during installation whether a default server should be configured ( maybe switched off via an environment variable ).

That would make things more easy for container environments ( according to my current experience with Kubernetes ).

Just my 2 cents

Christoph

1 Like

By the way, FoundationDB 3.x was able to form a cluster using nodes from the same image. I used to have scripts (packer+terraform) that did exactly that. Something has probably changed between 3.x and 5.x that altered this behaviour.

That’s right. In 4.0, a feature was added to support setting machine classes through fdbcli. It was desired that this machine class follow the process around even if its data files were moved, so we renamed the existing empty .fdb-lock file and put the process ID in it.

@ajbeamon - Indeed!

The original data folder was created upon package installation, at container build time. So all containers had the same data folder for process *:4500.

That’s a clever reason! Thank you for the explanation, it is sensible.
Should the process being “muted” appear as so in the status details?

Thank you very much and congrats on the cool work!

Regards,
Marc

@ajbeamon - thank you for the explanation!