I have a small test cluster with three machines. Under a single-threaded range scan workload every fdbserver process in the cluster crashes with OutOfMemory in a loop forever. Has anyone else experienced a similar phenomenon?
Details: Each of my machines has 17 procs. The cluster is in double redundancy mode with ssd storage engine. Servers are given 16GiB of memory. The database contains 25GiB logical / 72GiB physical bytes of keys uniformly distributed over the 128-bit hex-formatted numbers, such as 012341463b7c48dc6d2c84f6e25d1e01, and values are 1000-byte-long strings of the ASCII letter ‘x’.
The read-only workload is a sequence of 10000-row range scans with random starting keys, generated with this shell fragment:
paste -d " " <(yes getrange) <(hexdump -e '3/1 "%02x" " ~ 10000" "\n"' /dev/urandom) | fdbcli
Which generates an infinite stream of commands such as
getrange 4c5d4b ~ 10000
getrange e15b7d ~ 10000
getrange c17229 ~ 10000
getrange 002e54 ~ 10000
[etc]
The fdbserver processes crash, restart, begin with a normal amount of memory and slowly increase to 16GiB whereupon they commit suicide. For example here are the first 8 processes in my cluster at the moment, according to status details
. They have more or less memory based on their uptime.
Process performance details:
10.58.227.29:5000 ( 2% cpu; 5% machine; 0.038 Gbps; 11% disk IO; 2.3 GB / 21.8 GB RAM )
10.58.227.29:5001 ( 3% cpu; 5% machine; 0.038 Gbps; 12% disk IO;13.9 GB / 21.8 GB RAM )
10.58.227.29:5002 ( 3% cpu; 5% machine; 0.038 Gbps; 12% disk IO;11.9 GB / 21.8 GB RAM )
10.58.227.29:5003 ( 2% cpu; 5% machine; 0.038 Gbps; 12% disk IO; 6.7 GB / 21.8 GB RAM )
10.58.227.29:5004 ( 2% cpu; 5% machine; 0.038 Gbps; 13% disk IO;15.4 GB / 21.8 GB RAM )
10.58.227.29:5005 ( 2% cpu; 5% machine; 0.038 Gbps; 11% disk IO; 2.4 GB / 21.8 GB RAM )
10.58.227.29:5006 ( 2% cpu; 5% machine; 0.038 Gbps; 12% disk IO; 3.8 GB / 21.8 GB RAM )
10.58.227.29:5007 ( 2% cpu; 5% machine; 0.038 Gbps; 13% disk IO;11.1 GB / 21.8 GB RAM )
fdbmonitor logs these forever:
Time="1542044692.128336" Severity="40" LogGroup="default" Process="fdbserver.5016": ERROR: Out of memory
Time="1542044692.348744" Severity="40" LogGroup="default" Process="fdbserver.5016": Process 17614 exited 20, restarting in 0 seconds
Time="1542044692.349131" Severity="10" LogGroup="default" Process="fdbserver.5016": Launching /home/jbaker/fdbserver/fdbserver (46165) for fdbserver.5016
Time="1542044692.393349" Severity="10" LogGroup="default" Process="fdbserver.5016": FDBD joined cluster.
Time="1542044692.825305" Severity="40" LogGroup="default" Process="fdbserver.5010": ERROR: Out of memory
Time="1542044693.004250" Severity="40" LogGroup="default" Process="fdbserver.5010": Process 17635 exited 20, restarting in 0 seconds
Time="1542044693.004585" Severity="10" LogGroup="default" Process="fdbserver.5010": Launching /home/jbaker/fdbserver/fdbserver (46169) for fdbserver.5010
Time="1542044693.114201" Severity="10" LogGroup="default" Process="fdbserver.5010": FDBD joined cluster.
Time="1542044695.363137" Severity="40" LogGroup="default" Process="fdbserver.5015": ERROR: Out of memory
Time="1542044695.552435" Severity="40" LogGroup="default" Process="fdbserver.5015": Process 22673 exited 20, restarting in 0 seconds
Time="1542044695.552915" Severity="10" LogGroup="default" Process="fdbserver.5015": Launching /home/jbaker/fdbserver/fdbserver (46186) for fdbserver.5015
Time="1542044695.720293" Severity="10" LogGroup="default" Process="fdbserver.5015": FDBD joined cluster.
Just before they die, the fdbserver processes log traces like these, which lead me to believe that the allocator is unaware of the memory situation:
<Event Severity="10" Time="1542040214.826810" Type="MemoryMetrics" ID="0000000000000000"
AllocatedMemory16="131072" ApproximateUnusedMemory16="0" AllocatedMemory32="1179648"
ApproximateUnusedMemory32="786432" AllocatedMemory64="17432576"
ApproximateUnusedMemory64="1441792" AllocatedMemory128="66846720"
ApproximateUnusedMemory128="0" AllocatedMemory256="1441792"
ApproximateUnusedMemory256="1179648" AllocatedMemory512="393216"
ApproximateUnusedMemory512="131072" AllocatedMemory1024="131072"
ApproximateUnusedMemory1024="0" AllocatedMemory2048="4194304"
ApproximateUnusedMemory2048="3932160" AllocatedMemory4096="962330624"
ApproximateUnusedMemory4096="2883584" Machine="10.58.75.25:5016" LogGroup="default" Roles="SS"
/>
The process seems to have relatively little RSS and quite a large VM:
$ grep -E 'Rss|Vm' status
VmPeak: 13888952 kB
VmSize: 13888952 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 72996 kB
VmRSS: 72996 kB
RssAnon: 53412 kB
RssFile: 19584 kB
RssShmem: 0 kB
VmData: 13714524 kB
VmStk: 132 kB
VmExe: 23144 kB
VmLib: 4712 kB
VmPTE: 240 kB
VmPMD: 64 kB
VmSwap: 0 kB
And smaps confirms that most of this space is not in Rss, it’s in private anonymous hugetlb maps:
Size: 2048 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 2048 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 2048 kB
MMUPageSize: 2048 kB
Locked: 0 kB
It looks like the process has limited itself or been limited by fdbmonitor to a virtual address space equal to the memory I’ve configured for it (16GiB)
$ grep address /proc/35546/limits
Max address space 17179869184 unlimited bytes
and I can see that the fdbserver process does a lot of these 128KiB mmaps:
$ strace -e mmap -p 35546
mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7f01e0000000
mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7f01dfe00000
mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7f01dfc00000
mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7f01dfa00000
So it appears to leak these maps without paging them in. It’s leaking 2MiB per 128KiB map because it’s explicitly requesting a huge page. How can I debug this further, or fix it?