Processes OOM, fdbmonitor doesn't restart

(this may or may not be related to the other posts but it’s also slightly different so new post)

Occasionally a process will OOM and appear to die. The process is no longer in the cluster and fdbmonitor also doesn’t restart the process probably because the process is still running as Linux sees it.

Obviously, I’d like an OOM’d process to die such that fdbmonitor can restart it. Short of that we have a utility that’s comparing active processes vs. what’s in foundationdb.conf but that’s sort of a hack.

Happens to both memory & ssd engines.

Thoughts?

ubuntu@symphony-2b-db3-i-0b4a1fb4377dd609f:~$ grep 4606 /var/log/syslog*
/var/log/syslog.1:Dec 21 21:15:31 localhost fdbmonitor[8697]: LogGroup="default" Process="fdbserver.4606": ERROR: Out of memory
ubuntu@symphony-2b-db3-i-0b4a1fb4377dd609f:~$ fdbcli --exec 'status details' -C /etc/foundationdb/memory.cluster | grep 10.24.5.75
  10.24.5.75:4600        ( 27% cpu; 17% machine; 0.352 Gbps;  2% disk IO;20.5 GB / 15.0 GB RAM  )
  10.24.5.75:4601        ( 46% cpu; 17% machine; 0.352 Gbps;  4% disk IO;12.7 GB / 15.0 GB RAM  )
  10.24.5.75:4602        ( 54% cpu; 17% machine; 0.352 Gbps;  9% disk IO;13.1 GB / 15.0 GB RAM  )
  10.24.5.75:4603        ( 25% cpu; 17% machine; 0.352 Gbps;  1% disk IO;20.2 GB / 15.0 GB RAM  )
  10.24.5.75:4604        ( 22% cpu; 17% machine; 0.352 Gbps;  2% disk IO;20.8 GB / 15.0 GB RAM  )
  10.24.5.75:4605        ( 22% cpu; 17% machine; 0.352 Gbps;  4% disk IO;15.1 GB / 15.0 GB RAM  )
  10.24.5.75:4607        ( 24% cpu; 17% machine; 0.352 Gbps;  2% disk IO;20.1 GB / 15.0 GB RAM  )
  10.24.5.75:4650        (  0% cpu; 17% machine; 0.352 Gbps;  2% disk IO; 0.3 GB / 15.0 GB RAM  )
  10.24.5.75:4651        ( 28% cpu; 17% machine; 0.352 Gbps;  4% disk IO; 1.0 GB / 15.0 GB RAM  )
  10.24.5.75:4652        ( 27% cpu; 17% machine; 0.352 Gbps;  9% disk IO; 0.8 GB / 15.0 GB RAM  )
  10.24.5.75:4653        ( 24% cpu; 17% machine; 0.352 Gbps;  2% disk IO; 0.7 GB / 15.0 GB RAM  )
foundat+  92880   8697 42 Oct06 ?        32-21:13:00 /usr/sbin/fdbserver --cluster_file /etc/foundationdb/memory.cluster --datadir /mnt/fdb/4606 --knob_max_shard_bytes 100000000 --listen_address public --logdir /mnt/logs/fdb --memory 22GiB --public_address auto:4606 --storage_memory 14GiB

Caught one today and was able to grab some strace output. Process 4653 fell out of the cluster. It also stopped logging back in November (related to this). Nothing of note in syslog.

Process is defined as follows:

# Process port 4653
[fdbserver.4653]
command = /usr/bin/numactl -m 1 -N 1 /usr/sbin/fdbserver
cluster_file = /etc/foundationdb/memory3.cluster
class = transaction
storage_memory = 3GiB

Process exists in the process table:

ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:~$ ps -eaf | grep 4653
foundat+   2136   2054 35 Aug13 ?        46-03:18:08 /usr/sbin/fdbserver --class transaction --cluster_file /etc/foundationdb/memory3.cluster --datadir /mnt/fdb/4653 --knob_max_shard_bytes 100000000 --listen_address public --logdir /mnt/logs/fdb --public_address auto:4653 --storage_memory 3GiB

Process does not exist in the cluster:

ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:/mnt/logs/fdb$ fdbcli -C /etc/foundationdb/memory3.cluster --exec 'status details' | grep 10.31.0.87
  10.31.0.87:4603        ( 33% cpu; 24% machine; 0.591 Gbps;  2% disk IO;13.2 GB / 41.0 GB RAM  )
  10.31.0.87:4607        ( 23% cpu; 24% machine; 0.591 Gbps;  2% disk IO;13.2 GB / 41.0 GB RAM  )
  10.31.0.87:4611        ( 22% cpu; 24% machine; 0.591 Gbps;  2% disk IO;13.0 GB / 41.0 GB RAM  )
  10.31.0.87:4615        ( 21% cpu; 24% machine; 0.591 Gbps;  1% disk IO;13.2 GB / 41.0 GB RAM  )
  10.31.0.87:4657        ( 28% cpu; 24% machine; 0.591 Gbps;  1% disk IO; 1.4 GB / 41.0 GB RAM  )

strace of seemingly dead process:

ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:/mnt/logs/fdb$ sudo strace -p 2136
strace: Process 2136 attached
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff^Cstrace: Process 2136 detached
 <detached ...>

Killing the process returns it to normal:

ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:~$ /opt/wavefront/repo/tools/killFdbProcesses.sh 4653

Considering processes: 2136
<---------2136----------
DONE.

ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:~$ grep 4653 /var/log/syslog
Dec 22 23:56:05 localhost fdbmonitor[2054]: LogGroup="default" Process="fdbserver.4653": Process 2136 terminated by signal 15, restarting in 0 seconds
Dec 22 23:56:05 localhost fdbmonitor[2054]: LogGroup="default" Process="fdbserver.4653": Launching /usr/bin/numactl (129018) for fdbserver.4653
Dec 22 23:56:05 localhost fdbmonitor[2054]: LogGroup="default" Process="fdbserver.4653": FDBD joined cluster.

ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:~$ fdbcli -C /etc/foundationdb/memory3.cluster --exec 'status details' | grep 10.31.0.87
  10.31.0.87:4603        ( 24% cpu; 24% machine; 0.675 Gbps;  3% disk IO;13.2 GB / 35.5 GB RAM  )
  10.31.0.87:4607        ( 22% cpu; 24% machine; 0.675 Gbps;  2% disk IO;13.2 GB / 35.5 GB RAM  )
  10.31.0.87:4611        ( 22% cpu; 24% machine; 0.675 Gbps;  3% disk IO;13.0 GB / 35.5 GB RAM  )
  10.31.0.87:4615        ( 22% cpu; 24% machine; 0.675 Gbps;  2% disk IO;13.2 GB / 35.5 GB RAM  )
  10.31.0.87:4653        (  0% cpu; 24% machine; 0.675 Gbps;  3% disk IO; 0.2 GB / 35.5 GB RAM  )
  10.31.0.87:4657        ( 29% cpu; 24% machine; 0.675 Gbps;  2% disk IO; 1.4 GB / 35.5 GB RAM  )

FWIW, we also see about “sharp” increase in the process’s memory foot print. This is a 2MB increase but then it just stops working.

I think what has happened here is that your trace logging thread has failed, which can result in memory climbing as trace events are buffered up. Then when the process runs out of memory, it attempts to shut down and blocks on a call to flush the trace file. This of course can never complete because the trace thread is dead.

I think there’s probably a few things we should do differently here to improve the situation:

  1. Don’t block a shutdown because of the inability to flush the trace log.
  2. If we are going to allow a process to continue running after its trace thread has died, we should not run out of memory. Or, if we want the process to die in this case, we shouldn’t rely on running out of memory to do it.
  3. Provide better insight into why the trace thread has failed and more visibility into the fact that it has happened (e.g. by logging to stderr and/or reporting it through the status mechanism).

As to how to figure out what’s happening with your trace thread in this particular case, I’ll have that discussion on your other post.

I created an issue for the things I mentioned in my comment: https://github.com/apple/foundationdb/issues/2506.