Hiya Folks,
Twice yesterday, across 3 separate sets of fdbserver
processes running the storage
role, we ran into what looks like either some pathological case in our read or write paths that caused there processes to consume a lot of memory or a potential leak. We don’t have enough data either way at this point to determine the likely cause. Oddly, we did not observe a large increase in Read or Write IOPS during this time or in a spike in requests to the servers. However, it is possible we are missing some instrumentation.
In both cases 3 fdbserver processes running the storage
roles drastically increased memory usage and either hit an OutOfMemory
condition in the Plaform/FastAlloc slab allocator code, or actually segfaulted.
The interesting part is, this condition appeared simultaneously across 3 individual fdbserver
processes which were spread across 3 AWS availability zones, so it appears to have affected an entire “team”. The storage servers run 4 fdbserver processes (one per core) and the other 3 processes on the affected servers remained operational. During the time the entire team was down, fdbcli showed a Fault Tolerance of zero, as expected. The good news is our stream processing systems that work with FoundationDB eventually hit head-of-line blocking while the team was down, so no data was lost, just delayed as we waited for the processes to recover.
When we hit the OutOfMemory
condition the fdbserver process would exit cleanly and be restarted by fdbmonitor
quickly. However, in cases when we hit a SIGSEGV
, apport on ubuntu trapped the signal and attempted to write a .crash file. This kept the hung process up for 10-20 minutes before fdbmonitor could restart.
In most cases apport gave up on writing the core dump and crash file, however I was able to capture one complete core dump. Our version of fdbserver 6.0.18 is stripped though so the backtrace isn’t very useful.
We have a few questions on best steps forward to debug.
1.) Are there versions of 6.0.18 available that are compiled with debug symbols that we can use to gdb the core file and get a better backtrace?
2.) Are there additional debug flags we should enable to try to capture more data about the cause of the incident?
3.) Any suggestions on what else to look for in the trace log files to determine the cause?
Additionally we noticed that we exhausted the ApproximateUnusedMemory16 slabs before we hit the OutOfMemory
function.
Thanks in advance for any debugging suggestions you can provide. Additional system information is provided below.
<Event Severity="10" Time="1589230953.900548" Type="MemoryMetrics" ID="0000000000000000" TotalMemory16="262144" ApproximateUnusedMemory16="0" ActiveThreads16="1" TotalMemory32="38141952" ApproximateUnusedMemory32="37748736" ActiveThreads32="1" TotalMemory64="762839040" ApproximateUnusedMemory64="1966080" ActiveThreads64="3" TotalMemory128="1171652608" ApproximateUnusedMemory128="2883584" ActiveThreads128="1" TotalMemory256="657195008" ApproximateUnusedMemory256="368705536" ActiveThreads256="1" TotalMemory512="1138360320" ApproximateUnusedMemory512="3801088" ActiveThreads512="1" TotalMemory1024="86769664" ApproximateUnusedMemory1024="86507520" ActiveThreads1024="1" TotalMemory2048="772145152" ApproximateUnusedMemory2048="771883008" ActiveThreads2048="1" TotalMemory4096="2386427904" ApproximateUnusedMemory4096="285605888" ActiveThreads4096="1" Machine="x.x.x.x:4503" LogGroup="default" Roles="SS" />
<Event Severity="40" Time="1589230957.218488" Type="OutOfMemory" ID="0000000000000000" Message="Out of memory" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1428186 0x14268f7 0x1414e01 0x1414e9c 0x1414f19 0x13f1ae0 0x13f2742 0x45b805 0x45d523 0x1427d88 0x14281b8 0x14268f7 0x1414e01 0x1414e9c 0x1414f19 0x13f1ae0 0x13f2742 0x45b805 0x45d523 0x1427d88 0x14281b8 0x14268f7 0x1414e01 0x1414e9c 0x1414f19 0x13f1ae0 0x13f2742 0x45b805 0x45d523 0x1427d88 0x14281b8 0x14268f7 0x1414e01 0x1414e9c 0x1414f19 0x13f1ae0 0x13f2742 0xa1010f 0xa2e653 0x591275 0x135ecd8 0x135f113 0x135f5d8 0x1361464 0x136051b 0x1361855 0x5..." Machine="x.x.x.x:4503" LogGroup="default" Roles="SS" />
Cluster Info
---------------------------------------
FDB Version: 6.0.18
Configuration:
Redundancy mode - triple
Storage engine - ssd-2
Cluster:
FoundationDB processes - 462
Machines - 126
Fault Tolerance - 2 machines
OS
---------------------------------------
Description: Ubuntu 16.04.6 LTS
Release: 16.04
Codename: xenial
Linux version 4.4.0-1050-aws (buildd@lgw01-amd64-017) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5) ) #59-Ubuntu SMP Tue Jan 30 19:57:10 UTC 2018
The storage nodes are AWS i3.xlarge instances
H/W path Device Class Description
----------------------------------------------------------
system HVM domU
/0 bus Motherboard
/0/0 memory 96KiB BIOS
/0/401 processor Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
/0/402 processor CPU
/0/403 processor CPU
/0/404 processor CPU
/0/1000 memory 30GiB System Memory
/0/1000/0 memory 16GiB DIMM RAM
/0/1000/1 memory 14GiB DIMM RAM
/0/100 bridge 440FX - 82441FX PMC [Natoma]
/0/100/1 bridge 82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.1 storage 82371SB PIIX3 IDE [Natoma/Triton II]
/0/100/1.3 bridge 82371AB/EB/MB PIIX4 ACPI
/0/100/2 display GD 5446
/0/100/3 ens3 network Elastic Network Adapter (ENA)
/0/100/1e storage NVMe SSD Controller
/0/100/1f generic Xen Platform Device
/1 vethee219e6 network Ethernet interface
/2 veth1900bd7 network Ethernet interface
ProblemType: Crash
Architecture: amd64
Date: Mon May 11 21:07:13 2020
DistroRelease: Ubuntu 16.04
ExecutablePath: /usr/sbin/fdbserver
ExecutableTimestamp: 1547074069
ProcCmdline: /usr/sbin/fdbserver --knob_cleaning_interval 0.1 --class storage --cluster_file /etc/foundationdb/fdb.cluster --datadir /data/4503 --listen_address public --locality_zoneid us-west-2c --logdir /var/log/foundationdb --public_address auto:4503
ProcCwd: /
ProcEnviron:
LANG=en_US.UTF-8
PATH=(custom, no user)
ProcMaps:
00400000-01c72000 r-xp 00000000 ca:01 53625 /usr/sbin/fdbserver
01e71000-0202a000 rw-p 01871000 ca:01 53625 /usr/sbin/fdbserver
0202a000-02048000 rw-p 00000000 00:00 0
0315b000-759d4000 rw-p 00000000 00:00 0 [heap]
7fd56fcc7000-7fd645d88000 rw-p 00000000 00:00 0
7fd645d89000-7fd66cea9000 rw-p 00000000 00:00 0
7fd66ceaa000-7fd68780a000 rw-p 00000000 00:00 0
7fd68780b000-7fd6a678b000 rw-p 00000000 00:00 0
7fd6a678c000-7fd6bb4ac000 rw-p 00000000 00:00 0
7fd6bb4ad000-7fd6c582d000 rw-p 00000000 00:00 0
7fd6c582e000-7fd6c7cce000 rw-p 00000000 00:00 0
7fd6c7ce5000-7fd6d5225000 rw-p 00000000 00:00 0
7fd6d5226000-7fd6db9c6000 rw-p 00000000 00:00 0
7fd6db9c7000-7fd6dcce7000 rw-p 00000000 00:00 0
7fd6dcd01000-7fd6e6941000 rw-p 00000000 00:00 0
7fd6e6942000-7fd6e7102000 rw-p 00000000 00:00 0
7fd6e7103000-7fd6e7503000 rw-p 00000000 00:00 0
7fd6e7520000-7fd6ec000000 rw-p 00000000 00:00 0
7fd6ec000000-7fd6ec021000 rw-p 00000000 00:00 0
7fd6ec021000-7fd6f0000000 ---p 00000000 00:00 0
7fd6f000a000-7fd6f300a000 rw-p 00000000 00:00 0
7fd6f301b000-7fd6f30bb000 rw-p 00000000 00:00 0
7fd6f30cc000-7fd6f312c000 rw-p 00000000 00:00 0
7fd6f313d000-7fd6f31bd000 rw-p 00000000 00:00 0
7fd6f31cf000-7fd6f32cf000 rw-p 00000000 00:00 0
7fd6f32e0000-7fd6f4000000 rw-p 00000000 00:00 0
7fd6f4000000-7fd6f4029000 rw-p 00000000 00:00 0
7fd6f4029000-7fd6f8000000 ---p 00000000 00:00 0
7fd6f801d000-7fd6f8e3d000 rw-p 00000000 00:00 0
7fd6f8e4e000-7fd6f9984000 rw-p 00000000 00:00 0
7fd6f9984000-7fd6f9985000 ---p 00000000 00:00 0
7fd6f9985000-7fd6fa185000 rw-p 00000000 00:00 0
7fd6fa185000-7fd6fa19b000 r-xp 00000000 ca:01 1965 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fd6fa19b000-7fd6fa39a000 ---p 00016000 ca:01 1965 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fd6fa39a000-7fd6fa39b000 rw-p 00015000 ca:01 1965 /lib/x86_64-linux-gnu/libgcc_s.so.1
7fd6fa39b000-7fd6fa39c000 ---p 00000000 00:00 0
7fd6fa39c000-7fd6fab9c000 rw-p 00000000 00:00 0
7fd6fab9c000-7fd6fad5c000 r-xp 00000000 ca:01 21455 /lib/x86_64-linux-gnu/libc-2.23.so
7fd6fad5c000-7fd6faf5c000 ---p 001c0000 ca:01 21455 /lib/x86_64-linux-gnu/libc-2.23.so
7fd6faf5c000-7fd6faf60000 r--p 001c0000 ca:01 21455 /lib/x86_64-linux-gnu/libc-2.23.so
7fd6faf60000-7fd6faf62000 rw-p 001c4000 ca:01 21455 /lib/x86_64-linux-gnu/libc-2.23.so
7fd6faf62000-7fd6faf66000 rw-p 00000000 00:00 0
7fd6faf66000-7fd6fb06e000 r-xp 00000000 ca:01 21458 /lib/x86_64-linux-gnu/libm-2.23.so
7fd6fb06e000-7fd6fb26d000 ---p 00108000 ca:01 21458 /lib/x86_64-linux-gnu/libm-2.23.so
7fd6fb26d000-7fd6fb26e000 r--p 00107000 ca:01 21458 /lib/x86_64-linux-gnu/libm-2.23.so
7fd6fb26e000-7fd6fb26f000 rw-p 00108000 ca:01 21458 /lib/x86_64-linux-gnu/libm-2.23.so
7fd6fb26f000-7fd6fb276000 r-xp 00000000 ca:01 21438 /lib/x86_64-linux-gnu/librt-2.23.so
7fd6fb276000-7fd6fb475000 ---p 00007000 ca:01 21438 /lib/x86_64-linux-gnu/librt-2.23.so
7fd6fb475000-7fd6fb476000 r--p 00006000 ca:01 21438 /lib/x86_64-linux-gnu/librt-2.23.so
7fd6fb476000-7fd6fb477000 rw-p 00007000 ca:01 21438 /lib/x86_64-linux-gnu/librt-2.23.so
7fd6fb477000-7fd6fb48f000 r-xp 00000000 ca:01 21442 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fd6fb48f000-7fd6fb68e000 ---p 00018000 ca:01 21442 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fd6fb68e000-7fd6fb68f000 r--p 00017000 ca:01 21442 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fd6fb68f000-7fd6fb690000 rw-p 00018000 ca:01 21442 /lib/x86_64-linux-gnu/libpthread-2.23.so
7fd6fb690000-7fd6fb694000 rw-p 00000000 00:00 0
7fd6fb694000-7fd6fb697000 r-xp 00000000 ca:01 21440 /lib/x86_64-linux-gnu/libdl-2.23.so
7fd6fb697000-7fd6fb896000 ---p 00003000 ca:01 21440 /lib/x86_64-linux-gnu/libdl-2.23.so
7fd6fb896000-7fd6fb897000 r--p 00002000 ca:01 21440 /lib/x86_64-linux-gnu/libdl-2.23.so
7fd6fb897000-7fd6fb898000 rw-p 00003000 ca:01 21440 /lib/x86_64-linux-gnu/libdl-2.23.so
7fd6fb898000-7fd6fb8be000 r-xp 00000000 ca:01 21441 /lib/x86_64-linux-gnu/ld-2.23.so
7fd6fb8c4000-7fd6fb946000 rw-p 00000000 00:00 0
7fd6fb946000-7fd6fb947000 ---p 00000000 00:00 0
7fd6fb947000-7fd6fbab5000 rw-p 00000000 00:00 0
7fd6fbab9000-7fd6fbabb000 rw-s 00000000 00:0d 336401235 /[aio] (deleted)
7fd6fbabb000-7fd6fbabd000 rw-p 00000000 00:00 0
7fd6fbabd000-7fd6fbabe000 r--p 00025000 ca:01 21441 /lib/x86_64-linux-gnu/ld-2.23.so
7fd6fbabe000-7fd6fbabf000 rw-p 00026000 ca:01 21441 /lib/x86_64-linux-gnu/ld-2.23.so
7fd6fbabf000-7fd6fbac0000 rw-p 00000000 00:00 0
7fffe59bb000-7fffe59eb000 rw-p 00000000 00:00 0 [stack]
7fffe59f8000-7fffe59fb000 r--p 00000000 00:00 0 [vvar]
7fffe59fb000-7fffe59fd000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
ProcStatus:
Name: fdbserver
State: S (sleeping)
Tgid: 22423
Ngid: 0
Pid: 22423
PPid: 14599
TracerPid: 0
Uid: 113 113 113 113
Gid: 117 117 117 117
FDSize: 512
Groups:
NStgid: 22423
NSpid: 22423
NSpgid: 14599
NSsid: 14599
VmPeak: 8388608 kB
VmSize: 8388608 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 8198196 kB
VmRSS: 8198196 kB
VmData: 8346052 kB
VmStk: 192 kB
VmExe: 25032 kB
VmLib: 3224 kB
VmPTE: 16132 kB
VmPMD: 40 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
Threads: 4
SigQ: 0/122586
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 00000001840008c8
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
Seccomp: 0
Cpus_allowed: 7fff
Cpus_allowed_list: 0-14
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 2895
nonvoluntary_ctxt_switches: 13836
Signal: 11
Uname: Linux 4.4.0-1050-aws x86_64
UserGroups: