Fdbmonitor sends HUP to parent process [bug]

jehiah · June 25, 2018, 3:52pm

When fdbmonitor is handling signals it seems to be sending a HUP to it’s parent process. This is unexpected (to me) and a problem for the infrastructure i use to manage long running processes.

p.s. it’s a little unclear where to best tag bug reports so just let me know if this is better filed on github

Steps to reproduce:

run fdbmonitor in a bash script that logs HUP signals
send a TERM signal to fdbmonitor
expect no signal to the parent process, but see a HUP logged

$ uname -a
Linux hostname 4.15.3-1.el7.elrepo.x86_64 #1 SMP Mon Feb 12 06:46:25 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/redhat-release 
CentOS Linux release 7.3.1611 (Core)

catch_signals.sh

#!/bin/bash
echo "$0 PID $$"
cleanup_exit() {
   echo "exiting"
}
cleanup_hup() {
   echo "got HUP"
}
trap cleanup_exit EXIT
trap cleanup_hup HUP
echo "running $@"
$@
echo "got exit code $?"

logs

~/catch_signals.sh /usr/lib/foundationdb/fdbmonitor --conffile /data/fdbdata/conf/foundationdb.conf
/root/catch_signals.sh PID 23509
running /usr/lib/foundationdb/fdbmonitor --conffile /data/fdbdata/conf/foundationdb.conf
Time="1529940896.120814" Severity="10" LogGroup="default" Process="fdbmonitor": Started FoundationDB Process Monitor 5.1 (v5.1.5)
Time="1529940896.121028" Severity="10" LogGroup="default" Process="fdbmonitor": Watching conf file /data/fdbdata/conf/foundationdb.conf
Time="1529940896.121039" Severity="10" LogGroup="default" Process="fdbmonitor": Watching conf dir /data/fdbdata/conf (2)
Time="1529940896.121051" Severity="10" LogGroup="default" Process="fdbmonitor": Loading configuration /data/fdbdata/conf/foundationdb.conf
Time="1529940896.121435" Severity="10" LogGroup="default" Process="fdbmonitor": Starting backup_agent.1
Time="1529940896.121587" Severity="10" LogGroup="default" Process="fdbmonitor": Starting fdbserver.4700
Time="1529940896.122254" Severity="10" LogGroup="default" Process="fdbserver.4700": Launching /usr/sbin/fdbserver (23512) for fdbserver.4700
Time="1529940896.122272" Severity="10" LogGroup="default" Process="backup_agent.1": Launching /usr/lib/foundationdb/backup_agent/backup_agent (23511) for backup_agent.1
Time="1529940896.159089" Severity="10" LogGroup="default" Process="fdbserver.4700": FDBD joined cluster.
Time="1529940913.173416" Severity="20" LogGroup="default" Process="fdbmonitor": Received signal 15 (Terminated), shutting down
got HUP
got exit code 0
exiting

In a separate session show the process tree and send a TERM to fdbmonitor

$ pstree -ap 23509
catch_signals.s,23509 /root/catch_signals.sh /usr/lib/foundationdb/fdbmonitor --conffile /data/fdbdata/conf/foundationdb.conf
  └─fdbmonitor,23510 --conffile /data/fdbdata/conf/foundationdb.conf
      ├─backup_agent,23511 --cluster_file /data/fdbdata/conf/fdb.cluster --logdir /log/fdb
      │   ├─{backup_agent},23517
      │   └─{backup_agent},23518
      └─fdbserver,23512 --cluster_file /data/fdbdata/conf/fdb.cluster --datadir /data/fdb/4700 --listen_address 0.0.0.0:4700 --logdir /log/fdb --public_address ...
          ├─{fdbserver},23513
          ├─{fdbserver},23514
          ├─{fdbserver},23515
          └─{fdbserver},23516
$ kill 23510

ajbeamon · June 25, 2018, 7:18pm

I think this is happening because we send SIGHUP to every process in fdbmonitor’s process group when it dies (see https://github.com/apple/foundationdb/blob/5c9ef7763afaeb8dc467f4ae276532182a31d3c6/fdbmonitor/fdbmonitor.cpp#L1371).

The stated intent of this line is to send SIGHUP to each child process, but I don’t see any manipulations of the process group that would limit the scope of this signal to only the children. Is it the case that your signal catcher shares the same process group with fdbmonitor?

I’ve filed the following issue in GitHub to address this: https://github.com/apple/foundationdb/issues/529

jehiah · June 25, 2018, 8:04pm

Thank you.

That is my issue; all the affected processes (including parent) share the same process group ID.

ajbeamon · June 25, 2018, 8:34pm

It looks like we are creating a process group for fdbmonitor, but only if using the --daemonize argument. Presumably you aren’t using it? I’ll update the issue to note this as well, and we’ll try to get it fixed.

jehiah · June 25, 2018, 8:43pm

Correct. I am not using --daemonize. Knowing this is a process group issue was very helpful as i’m able to workaround that and set a process group explicitly using pgrphack.

SteavedHams · November 12, 2018, 9:21pm

@jehiah

This has been fixed in
https://github.com/apple/foundationdb/pull/826

Topic		Replies	Views
Processes OOM, fdbmonitor doesn't restart Using FoundationDB	4	2778	January 2, 2020
Why fdb monitor is needed in foundationdb and what it's purpose? Using FoundationDB	7	1188	May 24, 2019
Processes stop logging but appear to be doing work? Using FoundationDB	3	796	December 24, 2019
Setting CPU affinity for fdbservers Using FoundationDB	4	1315	November 18, 2018
Fdbmonitor starting identical copies of itself instead of fdbserver process (5.2.5, RHEL) Using FoundationDB	3	840	July 5, 2018

Fdbmonitor sends HUP to parent process [bug]

catch_signals.sh

logs

Related topics