How are 'contributing_workers' computed?

marctrem · May 7, 2018, 12:53am

Hello FDB friends,

I’d like to understand better what’s going on here: https://gist.github.com/marctrem/0287203daf2dc59099282be210c7c14c

I have two machines set up with 4 processes each. Why does fdbcli --exec status details only report 7 processes?
-or-
Why does the process 172.17.0.2:4500 not show up?

The cluster was configured using fdbcli --exec configure new double ssd

This is just a test configuration to play around. I know it is sub-optimal.

Thank you very much,
Marc

Evan · May 7, 2018, 3:06am

The most likely explanation is that the result is temporary, and if you do a status again the process will show up.

Processes do not register themselves with the cluster until after they have recovered their files off disk, so immediately after rebooting a cluster status may not have information from all processes.

As a side note, there is almost no reason to run an even number of coordinators. By having two coordinators if either of them fail your cluster will be down, which is strictly worse than just having one coordinator.

marctrem · May 7, 2018, 3:12am

Hi @Evan,

Thank you for replying!

It’s been an hour and it still does not show up! Also, this one is a freshly instantiated cluster. (Has not been through any kind of failure)

That’s part of what I meant by saying my configuration was sub-optimal

Would you have any other guess as for why this might happen?

Thank you very much,
Marc

Evan · May 7, 2018, 4:43am

My guess is that you have a different fdbsever process running on that port. Maybe a default install of the server is still running. You may find more clues if you grep the trace files for Severity 40 trace messages.

ajbeamon · May 7, 2018, 3:49pm

Specifically, the thing to check for is that the fdbserver process running on port 4500 is using the correct cluster file. If it’s using a different cluster file, then it’s actually part of a different database.

spullara · May 7, 2018, 5:45pm

I am similarly seeing processes missing that are configured in the foundationdb.conf file, even after waiting. Here is a status details and a conf file that show this behavior:

    fdb> status details

Using cluster file `/usr/local/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-2
  Coordinators           - 1
  Desired Logs           - 1

Cluster:
  FoundationDB processes - 2
  Machines               - 2
  Memory availability    - 4.4 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 05/07/18 10:40:45

Data:
  Replication health     - Healthy (Rebalancing)
  Moving data            - 0.002 GB
  Sum of key-value sizes - 78 MB
  Disk space used        - 173 MB

Operating space:
  Storage server         - 80.0 GB free on most full server
  Log server             - 80.0 GB free on most full server

Workload:
  Read rate              - 21 Hz
  Write rate             - 3 Hz
  Transactions started   - 6 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  127.0.0.1:4689         (  8% cpu; 31% machine; 0.000 Gbps;  0% disk IO; 6.3 GB / 6.5 GB RAM  )
  127.0.0.1:4691         ( 12% cpu; 34% machine; 0.000 Gbps;  0% disk IO; 4.2 GB / 4.4 GB RAM  )

Coordination servers:
  127.0.0.1:4689  (reachable)

Client time: 05/07/18 10:40:45

fdb> exclude
There are currently no servers excluded from the database.
To learn how to exclude a server, type `help exclude'.
fdb>

foundationdb.conf:

$ cat /usr/local/etc/foundationdb/foundationdb.conf
## foundationdb.conf
##
## Configuration file for FoundationDB server processes
## Full documentation is available at
## https://www.foundationdb.org/documentation/configuration.html#foundationdb-conf

[general]
restart_delay = 60
## by default, restart_backoff = restart_delay_reset_interval = restart_delay
# initial_restart_delay = 0
# restart_backoff = 60
# restart_delay_reset_interval = 60
cluster_file = /usr/local/etc/foundationdb/fdb.cluster
# kill_on_configuration_change = true

## Default parameters for individual fdbserver processes
[fdbserver]
command = /usr/local/libexec/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /usr/local/foundationdb/data/$ID
logdir = /usr/local/foundationdb/logs
# logsize = 10MiB
# maxlogssize = 100MiB
# machine_id = 
# datacenter_id = 
# class = 
# storage_memory = 1GiB
# metrics_cluster =
# metrics_prefix =
memory = 3GiB

## An individual fdbserver process with id 4689
## Parameters set here override defaults from the [fdbserver] section
[fdbserver.4689]
[fdbserver.4690]
[fdbserver.4691]

[backup_agent]
command = /usr/local/foundationdb/backup_agent/backup_agent
logdir = /usr/local/foundationdb/logs

[backup_agent.1]

There is nothing listening on 4690 and no log files show up.

ajbeamon · May 7, 2018, 6:02pm

In your case, it seems like there isn’t a process running at all. If you don’t see any trace logs for the process, then I would check the fdbmonitor logs (stored in syslog) for any details about why that process may not be starting.

marctrem · May 7, 2018, 6:04pm

@Evan @ajbeamon -

I only have one cluster configuration file on the system and here is the process tree: https://gist.github.com/marctrem/5bd9a5363638101ddafd1a1829c7da0a

There is only one instance of fdbmonitor having fdbserver children. The configuration is the one mentioned in the first message.

EDIT 1
Also, grep -Rnw 'Severity="40"' /var/log/foundationdb does not yield anything.

I am still puzzled.

Thank you very much,
Marc

ajbeamon · May 7, 2018, 6:47pm

Does killing the process and letting fdbmonitor restart it change anything?

There are a couple places we could look for more clues. One would be the full status json document (obtained by running status json in fdbcli), although if the process isn’t present in the cluster this may not yield much more that’s useful. The other place is in the trace logs, in particular for the process that seems to be running but isn’t present (the filename of the trace file should include the port number). You’ve determined that there aren’t any errors, but it may be worth looking through it for any other indications that something isn’t going as expected. If you wanted to post either of those here, I could take a look as well.

spullara · May 7, 2018, 7:37pm

In my case, killing fdbmonitor and it restarting picked up the process and started it.

marctrem · May 7, 2018, 9:45pm

@ajbeamon - The process is present; It shows up as a coordinator. Unlike the other coordinator, this one does not seem to contribute towards the FoundationDB processes count.

I tried killing the targeted fdbserver process, it came back up but still does not show up under process performance details

gist.github.com

https://gist.github.com/marctrem/0287203daf2dc59099282be210c7c14c#file-status-details-L47

fdb.cluster

mycluster:0pamNV2u@172.17.0.2:4500,172.17.0.3:4500

foundationdb.conf

## foundationdb.conf
##
## Configuration file for FoundationDB server processes
## Full documentation is available at
## https://www.foundationdb.org/documentation/configuration.html#foundationdb-conf

[fdbmonitor]
user = foundationdb
group = foundationdb

This file has been truncated. show original

status details

docker exec -ti f422ea2d99ca fdbcli --exec 'status details'   
Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 2

Cluster:
  FoundationDB processes - 7

This file has been truncated. show original

marctrem · May 8, 2018, 6:33am

@Evan @ajbeamon -

Interestingly enough, the first process of the first machine (172.17.0.2:4500) disappears when the second machine joins the cluster. Could it get somewhat promoted in another role?

I tried setting the second process of the second machine as the coordinator and this time, the first process of the second machine does not show up. It has logs and they look normal. The missing process also does not show up in the status json. (search for 172.17.0.3:4500) Funny enough, the coordinator process (172.17.0.2:4501) shows up in process performance this time.

Here are the logs and status

gist.github.com

https://gist.github.com/marctrem/cad6c4c0cf29d1468308cc5409d308ad

fdbcli status

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-2
  Coordinators           - 1

Cluster:

This file has been truncated. show original

status.json

{
  "client": {
    "cluster_file": {
      "path": "/etc/foundationdb/fdb.cluster",
      "up_to_date": true
    },
    "coordinators": {
      "coordinators": [
        {
          "address": "172.17.0.3:4501",

This file has been truncated. show original

trace.172.017.000.003.4500.1525729.xml

cat /var/log/foundationdb/trace.172.017.000.003.4500.1525729
.1525729837.Bh9rVj.1.xml  -n 20 /var/log/foundationdb/trace.172.017.000.003.4500 
<Event Severity="10" Time="1525730337.912112" Type="MachineMetrics" Machine="172.17.0.3:4500" ID="0000000000000000" Elapsed="5.00008" MbpsSent="0.325187" MbpsReceived="0.44504" OutSegs="4464" RetransSegs="0" CPUSeconds="0.25851" TotalMemory="16754233344" CommittedMemory="789528576" AvailableMemory="15964704768" ZoneID="ba5a49ef4a9c47ca9828c62d8ed6a599" MachineID="ba5a49ef4a9c47ca9828c62d8ed6a599" logGroup="default" TrackLatestType="Original"/>
<Event Severity="30" Time="1525730342.912197" Type="DeviceNotFound" Machine="172.17.0.3:4500" ID="0000000000000000" Directory="/var/lib/foundationdb/data/4500" logGroup="default"/>
<Event Severity="10" Time="1525730342.912197" Type="MachineLoadDetail" Machine="172.17.0.3:4500" ID="0000000000000000" User="1147775" Nice="14147" System="1397379" Idle="157745474" IOWait="808977" IRQ="170101" SoftIRQ="436455" Steal="0" Guest="0" logGroup="default"/>
<Event Severity="10" Time="1525730342.912197" Type="ProcessMetrics" Machine="172.17.0.3:4500" ID="0000000000000000" Elapsed="5.00009" CPUSeconds="0.025055" MainThreadCPUSeconds="0.024186" UptimeSeconds="505.087" Memory="327540736" ResidentMemory="18329600" MbpsSent="0.00639189" MbpsReceived="0.00543191" DiskTotalBytes="244529823744" DiskFreeBytes="165779386368" DiskQueueDepth="0" DiskIdleSeconds="5.00009" DiskReads="0" DiskWrites="0" DiskReadsCount="0" DiskWritesCount="0" DiskWriteSectors="0" DiskReadSectors="0" FileWrites="0" FileReads="0" CacheReadBytes="0" CacheFinds="0" CacheWritesBlocked="0" CacheReadsBlocked="0" CachePageReadsMerged="0" CacheWrites="0" CacheReads="0" ZoneID="ba5a49ef4a9c47ca9828c62d8ed6a599" MachineID="ba5a49ef4a9c47ca9828c62d8ed6a599" AIO_SubmitCount="0" AIO_CollectCount="0" AIO_SubmitLagMS="0" AIO_DiskStallMS="0" N2_CantSleep="0" N2_WontSleep="0" N2_Yields="1" N2_YieldCalls="0" N2_YieldCallsTrue="0" N2_SlowTaskSignals="0" N2_YieldBigStack="0" N2_RunLoopIterations="374" N2_TimersExecuted="230" N2_TasksExecuted="375" N2_ASIOEventsProcessed="453" N2_ReadCalls="80" N2_WriteCalls="80" N2_ReadProbes="80" N2_WriteProbes="0" N2_PacketsRead="80" N2_PacketsGenerated="80" N2_WouldBlock="0" CurrentConnections="3" ConnectionsEstablished="0" ConnectionsClosed="0" ConnectionErrors="0" N2_SlowTask4M="1" logGroup="default" TrackLatestType="Original"/>
<Event Severity="10" Time="1525730342.912197" Type="MachineMetrics" Machine="172.17.0.3:4500" ID="0000000000000000" Elapsed="5.00009" MbpsSent="0.324918" MbpsReceived="0.446426" OutSegs="4403" RetransSegs="0" CPUSeconds="0.259745" TotalMemory="16754233344" CommittedMemory="789774336" AvailableMemory="15964459008" ZoneID="ba5a49ef4a9c47ca9828c62d8ed6a599" MachineID="ba5a49ef4a9c47ca9828c62d8ed6a599" logGroup="default" TrackLatestType="Original"/>
<Event Severity="10" Time="1525730344.365825" Type="SomewhatSlowRunLoopTop" Machine="172.17.0.3:4500" ID="0000000000000000" Elapsed="0.0265214" logGroup="default"/>
<Event Severity="30" Time="1525730347.912281" Type="DeviceNotFound" Machine="172.17.0.3:4500" ID="0000000000000000" Directory="/var/lib/foundationdb/data/4500" logGroup="default"/>
<Event Severity="10" Time="1525730347.912281" Type="MachineLoadDetail" Machine="172.17.0.3:4500" ID="0000000000000000" User="1147827" Nice="14147" System="1397457" Idle="157748264" IOWait="809025" IRQ="170110" SoftIRQ="436480" Steal="0" Guest="0" logGroup="default"/>

This file has been truncated. show original

All processes have the DeviceNotFound event. From looking at the code, I assume that’s because of a missing Linux capability (on the container) and that it should not be related to the issue I’m currently experiencing. Oh, a maybe useful piece of information: I’m using the downloadable deb package. I can try from master tomorrow and see if the behavior still arises.

If you have any more ideas based on this, please let me know!

Thanks,
Marc

EDIT 1:
Same result with master (6.0.0-0INTERNAL)

marctrem · May 8, 2018, 7:11am

Solved:

Okay, installing the Debian package seems to run some kind of initialization which creates a data directory that seems to be bound to a particular cluster id.

I was providing a cluster file with a different cluster ID and one of my worker had the same worker id (4500) as the install-time-created one.

Fix: trash the data directory created when installing the package. (Could probably have used a different worker id)

Should fdbserver (and maybe fdbmonitor) have stopped (or crashed) under such circumstances to let me know that this wrongdoing was happening? Did I miss something?
Should installing the package start an fdbserver in order to initialize a data or should the user be tasked with running fdbcli --exec configure new ...?

Thank you very much for your valuable time!

Regards,
Marc

ajbeamon · May 8, 2018, 8:20pm

I don’t know of any reason why the data files would prevent you from joining a different cluster, unless for some reason they couldn’t be recovered (a process won’t join the cluster until it’s recovered its data files). It should be noted that a process with data files for one cluster will delete those files when it joins a different cluster, though.

It may be worth raising an issue in GitHub to try to reproduce and investigate what’s going on here.

ajbeamon · May 11, 2018, 5:27pm

Are you by chance using the exact same installed data files for both of your machines? I was able to reproduce this behavior by duplicating a process’s processId file to other processes in a cluster. If your machines are coming from some image with the same processId file created by the installer for the process at port 4500, then the cluster controller will essentially shut out all but one of those processes from the cluster. However, the other processes will keep running, so they can fulfill their role as coordinator if necessary.

If you are running into the situation I described, then I think deleting the processId file should be sufficient to solve the problem. When your process starts up, it will create a new one to replace it. However, it’s also fine to delete all of the data directory in your image if that makes sense for your use-case.

Chr1st0ph · May 11, 2018, 7:18pm

I agree with the view of @marctrem :

“Should installing the package start an fdbserver in order to initialize a data or should the user be tasked with running fdbcli --exec configure new …”

Imho it would be nice to let the user choose during installation whether a default server should be configured ( maybe switched off via an environment variable ).

That would make things more easy for container environments ( according to my current experience with Kubernetes ).

Just my 2 cents

Christoph

abdullin · May 11, 2018, 8:01pm

By the way, FoundationDB 3.x was able to form a cluster using nodes from the same image. I used to have scripts (packer+terraform) that did exactly that. Something has probably changed between 3.x and 5.x that altered this behaviour.

ajbeamon · May 11, 2018, 8:22pm

That’s right. In 4.0, a feature was added to support setting machine classes through fdbcli. It was desired that this machine class follow the process around even if its data files were moved, so we renamed the existing empty .fdb-lock file and put the process ID in it.

marctrem · May 12, 2018, 5:02am

@ajbeamon - Indeed!

The original data folder was created upon package installation, at container build time. So all containers had the same data folder for process *:4500.

That’s a clever reason! Thank you for the explanation, it is sensible.
Should the process being “muted” appear as so in the status details?

Thank you very much and congrats on the cool work!

Regards,
Marc

abdullin · May 12, 2018, 7:19am

@ajbeamon - thank you for the explanation!

Topic		Replies	Views
Only seven processes are online? Using FoundationDB	4	581	August 1, 2019
'Locking coordination' state after process removal Using FoundationDB	7	2089	July 11, 2019
30 server cluster just died Using FoundationDB	7	725	June 6, 2021
Coordinator-only process Using FoundationDB	2	643	October 20, 2018
Not seeing preferred number of Tx processes Running FoundationDB	4	470	August 28, 2020

How are 'contributing_workers' computed?

Related topics