Caught one today and was able to grab some strace
output. Process 4653 fell out of the cluster. It also stopped logging back in November (related to this). Nothing of note in syslog
.
Process is defined as follows:
# Process port 4653
[fdbserver.4653]
command = /usr/bin/numactl -m 1 -N 1 /usr/sbin/fdbserver
cluster_file = /etc/foundationdb/memory3.cluster
class = transaction
storage_memory = 3GiB
Process exists in the process table:
ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:~$ ps -eaf | grep 4653
foundat+ 2136 2054 35 Aug13 ? 46-03:18:08 /usr/sbin/fdbserver --class transaction --cluster_file /etc/foundationdb/memory3.cluster --datadir /mnt/fdb/4653 --knob_max_shard_bytes 100000000 --listen_address public --logdir /mnt/logs/fdb --public_address auto:4653 --storage_memory 3GiB
Process does not exist in the cluster:
ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:/mnt/logs/fdb$ fdbcli -C /etc/foundationdb/memory3.cluster --exec 'status details' | grep 10.31.0.87
10.31.0.87:4603 ( 33% cpu; 24% machine; 0.591 Gbps; 2% disk IO;13.2 GB / 41.0 GB RAM )
10.31.0.87:4607 ( 23% cpu; 24% machine; 0.591 Gbps; 2% disk IO;13.2 GB / 41.0 GB RAM )
10.31.0.87:4611 ( 22% cpu; 24% machine; 0.591 Gbps; 2% disk IO;13.0 GB / 41.0 GB RAM )
10.31.0.87:4615 ( 21% cpu; 24% machine; 0.591 Gbps; 1% disk IO;13.2 GB / 41.0 GB RAM )
10.31.0.87:4657 ( 28% cpu; 24% machine; 0.591 Gbps; 1% disk IO; 1.4 GB / 41.0 GB RAM )
strace
of seemingly dead process:
ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:/mnt/logs/fdb$ sudo strace -p 2136
strace: Process 2136 attached
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=2136, si_uid=112} ---
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
futex(0x7ffde91d9188, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff^Cstrace: Process 2136 detached
<detached ...>
Killing the process returns it to normal:
ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:~$ /opt/wavefront/repo/tools/killFdbProcesses.sh 4653
Considering processes: 2136
<---------2136----------
DONE.
ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:~$ grep 4653 /var/log/syslog
Dec 22 23:56:05 localhost fdbmonitor[2054]: LogGroup="default" Process="fdbserver.4653": Process 2136 terminated by signal 15, restarting in 0 seconds
Dec 22 23:56:05 localhost fdbmonitor[2054]: LogGroup="default" Process="fdbserver.4653": Launching /usr/bin/numactl (129018) for fdbserver.4653
Dec 22 23:56:05 localhost fdbmonitor[2054]: LogGroup="default" Process="fdbserver.4653": FDBD joined cluster.
ubuntu@redacted-2a-db8-i-01a19620cccdd9f16:~$ fdbcli -C /etc/foundationdb/memory3.cluster --exec 'status details' | grep 10.31.0.87
10.31.0.87:4603 ( 24% cpu; 24% machine; 0.675 Gbps; 3% disk IO;13.2 GB / 35.5 GB RAM )
10.31.0.87:4607 ( 22% cpu; 24% machine; 0.675 Gbps; 2% disk IO;13.2 GB / 35.5 GB RAM )
10.31.0.87:4611 ( 22% cpu; 24% machine; 0.675 Gbps; 3% disk IO;13.0 GB / 35.5 GB RAM )
10.31.0.87:4615 ( 22% cpu; 24% machine; 0.675 Gbps; 2% disk IO;13.2 GB / 35.5 GB RAM )
10.31.0.87:4653 ( 0% cpu; 24% machine; 0.675 Gbps; 3% disk IO; 0.2 GB / 35.5 GB RAM )
10.31.0.87:4657 ( 29% cpu; 24% machine; 0.675 Gbps; 2% disk IO; 1.4 GB / 35.5 GB RAM )