FoundationDB

Fdbmonitor sends HUP to parent process [bug]


(Jehiah) #1

When fdbmonitor is handling signals it seems to be sending a HUP to it’s parent process. This is unexpected (to me) and a problem for the infrastructure i use to manage long running processes.

p.s. it’s a little unclear where to best tag bug reports so just let me know if this is better filed on github


Steps to reproduce:

  • run fdbmonitor in a bash script that logs HUP signals
  • send a TERM signal to fdbmonitor
  • expect no signal to the parent process, but see a HUP logged
$ uname -a
Linux hostname 4.15.3-1.el7.elrepo.x86_64 #1 SMP Mon Feb 12 06:46:25 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/redhat-release 
CentOS Linux release 7.3.1611 (Core) 

catch_signals.sh

#!/bin/bash
echo "$0 PID $$"
cleanup_exit() {
   echo "exiting"
}
cleanup_hup() {
   echo "got HUP"
}
trap cleanup_exit EXIT
trap cleanup_hup HUP
echo "running $@"
$@
echo "got exit code $?"

logs

~/catch_signals.sh /usr/lib/foundationdb/fdbmonitor --conffile /data/fdbdata/conf/foundationdb.conf
/root/catch_signals.sh PID 23509
running /usr/lib/foundationdb/fdbmonitor --conffile /data/fdbdata/conf/foundationdb.conf
Time="1529940896.120814" Severity="10" LogGroup="default" Process="fdbmonitor": Started FoundationDB Process Monitor 5.1 (v5.1.5)
Time="1529940896.121028" Severity="10" LogGroup="default" Process="fdbmonitor": Watching conf file /data/fdbdata/conf/foundationdb.conf
Time="1529940896.121039" Severity="10" LogGroup="default" Process="fdbmonitor": Watching conf dir /data/fdbdata/conf (2)
Time="1529940896.121051" Severity="10" LogGroup="default" Process="fdbmonitor": Loading configuration /data/fdbdata/conf/foundationdb.conf
Time="1529940896.121435" Severity="10" LogGroup="default" Process="fdbmonitor": Starting backup_agent.1
Time="1529940896.121587" Severity="10" LogGroup="default" Process="fdbmonitor": Starting fdbserver.4700
Time="1529940896.122254" Severity="10" LogGroup="default" Process="fdbserver.4700": Launching /usr/sbin/fdbserver (23512) for fdbserver.4700
Time="1529940896.122272" Severity="10" LogGroup="default" Process="backup_agent.1": Launching /usr/lib/foundationdb/backup_agent/backup_agent (23511) for backup_agent.1
Time="1529940896.159089" Severity="10" LogGroup="default" Process="fdbserver.4700": FDBD joined cluster.
Time="1529940913.173416" Severity="20" LogGroup="default" Process="fdbmonitor": Received signal 15 (Terminated), shutting down
got HUP
got exit code 0
exiting

In a separate session show the process tree and send a TERM to fdbmonitor

$ pstree -ap 23509
catch_signals.s,23509 /root/catch_signals.sh /usr/lib/foundationdb/fdbmonitor --conffile /data/fdbdata/conf/foundationdb.conf
  └─fdbmonitor,23510 --conffile /data/fdbdata/conf/foundationdb.conf
      ├─backup_agent,23511 --cluster_file /data/fdbdata/conf/fdb.cluster --logdir /log/fdb
      │   ├─{backup_agent},23517
      │   └─{backup_agent},23518
      └─fdbserver,23512 --cluster_file /data/fdbdata/conf/fdb.cluster --datadir /data/fdb/4700 --listen_address 0.0.0.0:4700 --logdir /log/fdb --public_address ...
          ├─{fdbserver},23513
          ├─{fdbserver},23514
          ├─{fdbserver},23515
          └─{fdbserver},23516
$ kill 23510

(A.J. Beamon) #2

I think this is happening because we send SIGHUP to every process in fdbmonitor’s process group when it dies (see https://github.com/apple/foundationdb/blob/5c9ef7763afaeb8dc467f4ae276532182a31d3c6/fdbmonitor/fdbmonitor.cpp#L1371).

The stated intent of this line is to send SIGHUP to each child process, but I don’t see any manipulations of the process group that would limit the scope of this signal to only the children. Is it the case that your signal catcher shares the same process group with fdbmonitor?

I’ve filed the following issue in GitHub to address this: https://github.com/apple/foundationdb/issues/529


(Jehiah) #3

Thank you.

That is my issue; all the affected processes (including parent) share the same process group ID.


(A.J. Beamon) #4

It looks like we are creating a process group for fdbmonitor, but only if using the --daemonize argument. Presumably you aren’t using it? I’ll update the issue to note this as well, and we’ll try to get it fixed.


(Jehiah) #5

Correct. I am not using --daemonize. Knowing this is a process group issue was very helpful as i’m able to workaround that and set a process group explicitly using pgrphack.


(Steve Atherton) #6

@jehiah

This has been fixed in