In Wavefront land we use a version of wavefront-fdb-tailer to take the XML trace logs into time series data (internally it goes by the name “loghead
”).
We’ve been trying to track down a thing where processes appear to be in the cluster and appear to be doing work but the trace log is no longer updated. The timestamp is the most recently log file is hours to days in the past.
In this example, Fdb is configured to run 16 processes, from 4600-4615. Specifically,
# Process port 4613
[fdbserver.4613]
command = /usr/bin/numactl -m 1 -N 1 /usr/sbin/fdbserver
cluster_file = /etc/foundationdb/memory.cluster
memory = 22GiB
storage_memory = 14GiB
In the chart below you’ll see that 4613 is missing.
The Linux shows the process running:
ubuntu@redacted-db6-i-005f3e302c63cbd36:~$ ps -eaf | grep 4613
foundat+ 12742 12705 40 Jul13 ? 65-02:36:43 /usr/sbin/fdbserver --cluster_file /etc/foundationdb/memory.cluster --datadir /mnt/fdb/4613 --knob_max_shard_bytes 100000000 --listen_address public --logdir /mnt/logs/fdb --memory 22GiB --public_address auto:4613 --storage_memory 14GiB
ubuntu 18254 18240 0 19:44 pts/1 00:00:00 grep --color=auto 4613
strace
shows calls that appear to match other processes that are logging (it’s verbose but if it’s useful I can paste some traces).
status details
shows the process.
Last log was days ago:
ubuntu@redacted-2c-db6-i-005f3e302c63cbd36:/mnt/logs/fdb$ ls -ltra *4613*
-rw-r--r-- 1 foundationdb foundationdb 10491075 Dec 10 21:16 trace.010.011.000.068.4613.1563042066.kr4U1C.1357.xml
-rw-r--r-- 1 foundationdb foundationdb 10491349 Dec 10 23:56 trace.010.011.000.068.4613.1563042066.kr4U1C.1358.xml
-rw-r--r-- 1 foundationdb foundationdb 10493407 Dec 11 02:36 trace.010.011.000.068.4613.1563042066.kr4U1C.1359.xml
-rw-r--r-- 1 foundationdb foundationdb 10491435 Dec 11 05:16 trace.010.011.000.068.4613.1563042066.kr4U1C.1360.xml
-rw-r--r-- 1 foundationdb foundationdb 10490790 Dec 11 07:56 trace.010.011.000.068.4613.1563042066.kr4U1C.1361.xml
-rw-r--r-- 1 foundationdb foundationdb 10491141 Dec 11 10:36 trace.010.011.000.068.4613.1563042066.kr4U1C.1362.xml
-rw-r--r-- 1 foundationdb foundationdb 10490651 Dec 11 13:15 trace.010.011.000.068.4613.1563042066.kr4U1C.1363.xml
-rw-r--r-- 1 foundationdb foundationdb 10490318 Dec 11 15:54 trace.010.011.000.068.4613.1563042066.kr4U1C.1364.xml
-rw-r--r-- 1 foundationdb foundationdb 10492646 Dec 11 18:34 trace.010.011.000.068.4613.1563042066.kr4U1C.1365.xml
-rw-r--r-- 1 foundationdb foundationdb 10490647 Dec 11 21:13 trace.010.011.000.068.4613.1563042066.kr4U1C.1366.xml
-rw-r--r-- 1 foundationdb foundationdb 2746311 Dec 11 21:54 trace.010.011.000.068.4613.1563042066.kr4U1C.1367.xml
Killing the process ("kill $pid
") recovers the log file causing it to go into KVSMemRecoveryLogSnap
.
Is this a fault of Fdb? And how to troubleshoot more? If it encounters a fault in writing the trace log, does it just give up? Or is the process actually dead and not logging?