Processes stop logging but appear to be doing work?

In Wavefront land we use a version of wavefront-fdb-tailer to take the XML trace logs into time series data (internally it goes by the name “loghead”).

We’ve been trying to track down a thing where processes appear to be in the cluster and appear to be doing work but the trace log is no longer updated. The timestamp is the most recently log file is hours to days in the past.

In this example, Fdb is configured to run 16 processes, from 4600-4615. Specifically,

# Process port 4613
[fdbserver.4613]
command = /usr/bin/numactl -m 1 -N 1 /usr/sbin/fdbserver
cluster_file = /etc/foundationdb/memory.cluster
memory = 22GiB
storage_memory = 14GiB

In the chart below you’ll see that 4613 is missing.

The Linux shows the process running:

ubuntu@redacted-db6-i-005f3e302c63cbd36:~$ ps -eaf | grep 4613
foundat+  12742  12705 40 Jul13 ?        65-02:36:43 /usr/sbin/fdbserver --cluster_file /etc/foundationdb/memory.cluster --datadir /mnt/fdb/4613 --knob_max_shard_bytes 100000000 --listen_address public --logdir /mnt/logs/fdb --memory 22GiB --public_address auto:4613 --storage_memory 14GiB
ubuntu    18254  18240  0 19:44 pts/1    00:00:00 grep --color=auto 4613

strace shows calls that appear to match other processes that are logging (it’s verbose but if it’s useful I can paste some traces).

status details shows the process.

Last log was days ago:

ubuntu@redacted-2c-db6-i-005f3e302c63cbd36:/mnt/logs/fdb$ ls -ltra *4613*
-rw-r--r-- 1 foundationdb foundationdb 10491075 Dec 10 21:16 trace.010.011.000.068.4613.1563042066.kr4U1C.1357.xml
-rw-r--r-- 1 foundationdb foundationdb 10491349 Dec 10 23:56 trace.010.011.000.068.4613.1563042066.kr4U1C.1358.xml
-rw-r--r-- 1 foundationdb foundationdb 10493407 Dec 11 02:36 trace.010.011.000.068.4613.1563042066.kr4U1C.1359.xml
-rw-r--r-- 1 foundationdb foundationdb 10491435 Dec 11 05:16 trace.010.011.000.068.4613.1563042066.kr4U1C.1360.xml
-rw-r--r-- 1 foundationdb foundationdb 10490790 Dec 11 07:56 trace.010.011.000.068.4613.1563042066.kr4U1C.1361.xml
-rw-r--r-- 1 foundationdb foundationdb 10491141 Dec 11 10:36 trace.010.011.000.068.4613.1563042066.kr4U1C.1362.xml
-rw-r--r-- 1 foundationdb foundationdb 10490651 Dec 11 13:15 trace.010.011.000.068.4613.1563042066.kr4U1C.1363.xml
-rw-r--r-- 1 foundationdb foundationdb 10490318 Dec 11 15:54 trace.010.011.000.068.4613.1563042066.kr4U1C.1364.xml
-rw-r--r-- 1 foundationdb foundationdb 10492646 Dec 11 18:34 trace.010.011.000.068.4613.1563042066.kr4U1C.1365.xml
-rw-r--r-- 1 foundationdb foundationdb 10490647 Dec 11 21:13 trace.010.011.000.068.4613.1563042066.kr4U1C.1366.xml
-rw-r--r-- 1 foundationdb foundationdb  2746311 Dec 11 21:54 trace.010.011.000.068.4613.1563042066.kr4U1C.1367.xml

Killing the process ("kill $pid") recovers the log file causing it to go into KVSMemRecoveryLogSnap.

Is this a fault of Fdb? And how to troubleshoot more? If it encounters a fault in writing the trace log, does it just give up? Or is the process actually dead and not logging?

Caught one that more recently stopped - 4652. Nothing in syslog. Process hasn’t been logging a trace file for about 6 hours but shows in the cluster:

ubuntu@redacted-2a-db8-i-071d7131ab47e308e:/mnt/logs/fdb$ fdbcli -C /etc/foundationdb/memory2.cluster --exec 'status details' | grep 10.8.0.47
  10.8.0.47:4602         ( 31% cpu; 27% machine; 0.837 Gbps;  2% disk IO;13.8 GB / 34.4 GB RAM  )
  10.8.0.47:4606         ( 30% cpu; 27% machine; 0.837 Gbps;  2% disk IO;13.6 GB / 34.4 GB RAM  )
  10.8.0.47:4610         ( 43% cpu; 27% machine; 0.837 Gbps;  2% disk IO;13.8 GB / 34.4 GB RAM  )
  10.8.0.47:4614         ( 30% cpu; 27% machine; 0.837 Gbps;  2% disk IO;13.1 GB / 34.4 GB RAM  )
  10.8.0.47:4652         ( 65% cpu; 27% machine; 0.837 Gbps;  2% disk IO; 0.7 GB / 34.4 GB RAM  )
  10.8.0.47:4656         ( 51% cpu; 27% machine; 0.837 Gbps;  2% disk IO; 0.7 GB / 34.4 GB RAM  )
-rw-r--r-- 1 foundationdb foundationdb  4647005 Dec 21 14:00 trace.010.008.000.047.4652.1568237040.dYve7a.767.xml
ubuntu@redacted-2a-db8-i-071d7131ab47e308e:/mnt/logs/fdb$ vi trace.010.008.000.047.4652.1568237040.dYve7a.767.xml

ubuntu@redacted-2a-db8-i-071d7131ab47e308e:/mnt/logs/fdb$ date
Sat Dec 21 20:37:32 UTC 2019

Last log entry:

<Event Severity="10" Time="1576936850.532650" Type="TLogMetrics" Machine="10.8.0.47:4652" ID="d8c5ce3030cc1bcc" Elapsed="5.00002" bytesInput="7.39713e+06 48502.7 23643726080514" bytesDurable="6.7198e+06 460712 23643672672335" version="14822004912271" sharedBytesInput="41097776865725" sharedBytesDurable="41097723457546" kvstoreBytesUsed="122376192" kvstoreBytesFree="3220708184" kvstoreBytesAvailable="3220708184" kvstoreBytesTotal="3221225472" queueDiskBytesUsed="609316864" queueDiskBytesFree="965321584640" queueDiskBytesAvailable="965321584640" queueDiskBytesTotal="1869103546368" logGroup="default" TrackLatestType="Original

How to debug?

As mentioned on your other post, I’m guessing this is a result of the trace logging thread failing in your process. Looking through the code a bit, I’m not sure that there’s much built in to help diagnose this, so I think the best options at this point would be to use a custom build that adds some additional logging and/or to look for external evidence of something going wrong that would generate errors. My best guess as to the source of the error would be that some attempt to interact with the disk has failed, so possibly you could look for something to corroborate that.

The underlying disk here is the root EBS volume. And nothing in telegraf disk or diskio stands out to me. Most of these machines have about 40 processes and it’s just one that fails out.

Perhaps @panghy or @mengranewo could help with a custom build…