Continuous out-of-memory crashes in small cluster with modest read-only workload

I have a small test cluster with three machines. Under a single-threaded range scan workload every fdbserver process in the cluster crashes with OutOfMemory in a loop forever. Has anyone else experienced a similar phenomenon?

Details: Each of my machines has 17 procs. The cluster is in double redundancy mode with ssd storage engine. Servers are given 16GiB of memory. The database contains 25GiB logical / 72GiB physical bytes of keys uniformly distributed over the 128-bit hex-formatted numbers, such as 012341463b7c48dc6d2c84f6e25d1e01, and values are 1000-byte-long strings of the ASCII letter ‘x’.

The read-only workload is a sequence of 10000-row range scans with random starting keys, generated with this shell fragment:

paste -d " " <(yes getrange) <(hexdump -e '3/1 "%02x" " ~ 10000" "\n"' /dev/urandom) | fdbcli

Which generates an infinite stream of commands such as

getrange 4c5d4b ~ 10000
getrange e15b7d ~ 10000
getrange c17229 ~ 10000
getrange 002e54 ~ 10000
[etc]

The fdbserver processes crash, restart, begin with a normal amount of memory and slowly increase to 16GiB whereupon they commit suicide. For example here are the first 8 processes in my cluster at the moment, according to status details. They have more or less memory based on their uptime.

Process performance details:
  10.58.227.29:5000      (  2% cpu;  5% machine; 0.038 Gbps; 11% disk IO; 2.3 GB / 21.8 GB RAM  )
  10.58.227.29:5001      (  3% cpu;  5% machine; 0.038 Gbps; 12% disk IO;13.9 GB / 21.8 GB RAM  )
  10.58.227.29:5002      (  3% cpu;  5% machine; 0.038 Gbps; 12% disk IO;11.9 GB / 21.8 GB RAM  )
  10.58.227.29:5003      (  2% cpu;  5% machine; 0.038 Gbps; 12% disk IO; 6.7 GB / 21.8 GB RAM  )
  10.58.227.29:5004      (  2% cpu;  5% machine; 0.038 Gbps; 13% disk IO;15.4 GB / 21.8 GB RAM  )
  10.58.227.29:5005      (  2% cpu;  5% machine; 0.038 Gbps; 11% disk IO; 2.4 GB / 21.8 GB RAM  )
  10.58.227.29:5006      (  2% cpu;  5% machine; 0.038 Gbps; 12% disk IO; 3.8 GB / 21.8 GB RAM  )
  10.58.227.29:5007      (  2% cpu;  5% machine; 0.038 Gbps; 13% disk IO;11.1 GB / 21.8 GB RAM  )

fdbmonitor logs these forever:

Time="1542044692.128336" Severity="40" LogGroup="default" Process="fdbserver.5016": ERROR: Out of memory
Time="1542044692.348744" Severity="40" LogGroup="default" Process="fdbserver.5016": Process 17614 exited 20, restarting in 0 seconds
Time="1542044692.349131" Severity="10" LogGroup="default" Process="fdbserver.5016": Launching /home/jbaker/fdbserver/fdbserver (46165) for fdbserver.5016
Time="1542044692.393349" Severity="10" LogGroup="default" Process="fdbserver.5016": FDBD joined cluster.
Time="1542044692.825305" Severity="40" LogGroup="default" Process="fdbserver.5010": ERROR: Out of memory
Time="1542044693.004250" Severity="40" LogGroup="default" Process="fdbserver.5010": Process 17635 exited 20, restarting in 0 seconds
Time="1542044693.004585" Severity="10" LogGroup="default" Process="fdbserver.5010": Launching /home/jbaker/fdbserver/fdbserver (46169) for fdbserver.5010
Time="1542044693.114201" Severity="10" LogGroup="default" Process="fdbserver.5010": FDBD joined cluster.
Time="1542044695.363137" Severity="40" LogGroup="default" Process="fdbserver.5015": ERROR: Out of memory
Time="1542044695.552435" Severity="40" LogGroup="default" Process="fdbserver.5015": Process 22673 exited 20, restarting in 0 seconds
Time="1542044695.552915" Severity="10" LogGroup="default" Process="fdbserver.5015": Launching /home/jbaker/fdbserver/fdbserver (46186) for fdbserver.5015
Time="1542044695.720293" Severity="10" LogGroup="default" Process="fdbserver.5015": FDBD joined cluster.

Just before they die, the fdbserver processes log traces like these, which lead me to believe that the allocator is unaware of the memory situation:

<Event Severity="10" Time="1542040214.826810" Type="MemoryMetrics" ID="0000000000000000" 
AllocatedMemory16="131072" ApproximateUnusedMemory16="0" AllocatedMemory32="1179648" 
ApproximateUnusedMemory32="786432" AllocatedMemory64="17432576" 
ApproximateUnusedMemory64="1441792" AllocatedMemory128="66846720" 
ApproximateUnusedMemory128="0" AllocatedMemory256="1441792" 
ApproximateUnusedMemory256="1179648" AllocatedMemory512="393216" 
ApproximateUnusedMemory512="131072" AllocatedMemory1024="131072" 
ApproximateUnusedMemory1024="0" AllocatedMemory2048="4194304" 
ApproximateUnusedMemory2048="3932160" AllocatedMemory4096="962330624" 
ApproximateUnusedMemory4096="2883584" Machine="10.58.75.25:5016" LogGroup="default" Roles="SS" 
/>

The process seems to have relatively little RSS and quite a large VM:

$ grep -E 'Rss|Vm' status
VmPeak: 13888952 kB
VmSize: 13888952 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     72996 kB
VmRSS:     72996 kB
RssAnon:           53412 kB
RssFile:           19584 kB
RssShmem:              0 kB
VmData: 13714524 kB
VmStk:       132 kB
VmExe:     23144 kB
VmLib:      4712 kB
VmPTE:       240 kB
VmPMD:        64 kB
VmSwap:        0 kB

And smaps confirms that most of this space is not in Rss, it’s in private anonymous hugetlb maps:

Size:               2048 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:    2048 kB
Swap:                  0 kB
SwapPss:               0 kB
KernelPageSize:     2048 kB
MMUPageSize:        2048 kB
Locked:                0 kB

It looks like the process has limited itself or been limited by fdbmonitor to a virtual address space equal to the memory I’ve configured for it (16GiB)

$ grep address /proc/35546/limits
Max address space         17179869184          unlimited            bytes     

and I can see that the fdbserver process does a lot of these 128KiB mmaps:

$ strace -e mmap -p 35546
mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7f01e0000000
mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7f01dfe00000
mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7f01dfc00000
mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7f01dfa00000

So it appears to leak these maps without paging them in. It’s leaking 2MiB per 128KiB map because it’s explicitly requesting a huge page. How can I debug this further, or fix it?

You basically debugged 99% of this already. At this point, you can just do what I did, and search the codebase for MAP_HUGETLB. Or if you wished to debug this as a black box, you would need a symbolicated fdbserver binary, and probably to write a tiny shim library that wraps mmap, checks to see if MAP_HUGETLB is set in flags, and if so, call some function that crashes / you have a gdb breakpoint set on.

But, you’ve laid out enough detail that I believe I can fill in the rest for you anyway. So the background of what’s happening here:

FoundationDB is written in a style where arena allocation is frequently used to speed up request handling. The arenas are formed from chunks of 128KiB which are pulled from a freelist. If there aren’t any, we allocate another one, and when the arena is destroyed, its 128KiB chunks are added to the freelist.

You are correct that there exists code in FDB to explicitly request huge pages, and this is indeed requested if supported when allocating a 128KiB chunk. Looking at this now, I’m kind of not clear why. I’m going to guess that FDB hasn’t really been tested on systems that support large pages before, because it’s not clear to me why this wouldn’t always quickly OOM. Although we do always request huge pages, if the system ever doesn’t have one to offer, we never ask for them again.

There is unfortunately no knob offered that you can easily tweak to get out of this situation. Disabling hugepages on your systems would probably be the only sensible way forward. A similar trick of replacing mmap via LD_PRELOAD to return MAP_FAILED if MAP_HUGETLB is set would be an obscenely hacky workaround.

This really reads like what we should be doing is viewing magazine_size as 128KiB if huge pages aren’t supported, and 2MiB if they are supported.

You’re welcome to file an issue on GitHub about inappropriately requesting hugepages, or I’ll get around to filing one sometime later today. This also means I need to go figure out how to set up a VM with hugepage support… hrm.

We’re going to be cutting a hopefully-publishable 6.0 release real soon now, so I filed #909 and will try to quickly submit a PR that disables our requesting of huge pages, as it appears to be obviously broken.

Thanks. Is that why I’m the first person to hit this? It’s new in v6?

Not many people run their systems with huge pages enabled. The implication here is that you’re using some servers that someone, likely other than you, configured to run a bunch of services on, and also went through the effort of flipping on Hugepage support. Which is also why if you disable huge pages on your system, then FDB will stop doing the wrong thing.

$ cat /proc/meminfo | grep Huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Shows them as disabled by default on my Ubuntu 16.04 VM that’s used for most of my development.

You might be able to just do sudo sysctl -w vm.nr_hugepages=0 to flip them off, if there’s nothing else running on your machines that’ll be impacted.

It’ll be fixed in 6.0, as #910 was merged in time to make the cut.

I already have vm.nr_hugepages=0. I don’t believe that effectively disables hugepages, it only makes them non-persistent. I have nr_overcommit_hugepages=200000, intentionally. The output you show does not necessarily indicate that your system has huge pages disabled, although the absence of them is strong evidence. Mine looks like this. You can see that my hugepages are “surplus”, that is, over and above nr_hugepages.

$ grep -i huge /proc/meminfo
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:   88563
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:    88563
Hugepagesize:       2048 kB

But anyway, have you considered simply getting your 128KiB blocks of memory from operator new[]? instead of the FoundationDB-defined ::allocate? That would allow the user to build with their favorite allocator. Some of these allocators are quite battle-tested and some of them are even hugepages-aware.

Ah, you’re more knowledgeable than me on the subject then. If you have a link to suggest on the current state of Hugepage support and configuration for it, I’d be interested.

I mean, it’s a surprise to me that we’re directly calling mmap, and that we’re requesting huge pages. I think @Dave would be the only person that might know why this code exists. It kind of looks like it might have been a performance experiment that wasn’t finished, as the test for it is commented out?

I’d expect that replacing mmap with malloc would have no effect, as default glibc malloc would just call mmap anyway, but this is the sort of thing one would have to spend some moderate time performance testing to make sure that there’s not something that’d be subtly different in a way that would affect performance.

I see that tcmalloc has an open issue for Hugepage support, and I don’t see any comment about it in tcmalloc. Would you happen to be able to point me towards a description of which allocators do more intelligent things with Hugepage systems, and what those intelligent things are?

IIRC we long ago enabled large page support because it reduces TLB pressure, and thus improves performance a bit everywhere. I don’t have access to deep source control history, but I’m guessing the bug under discussion was introduced by reducing the size of allocations done by fast allocator, which was probably 2MiB+ originally.

Fast allocator never frees these blocks, so a fix that grabs memory in 2MiB chunks and gives it to fast allocator in smaller chunks, wouldn’t be rocket science. Getting the memory from a malloc that supports large pages would also probably be fine, though it could be bad if it tries to interleave unnecessary (because no free) bookkeeping information with otherwise perfectly aligned chunks.

TLB pressure is very important so I definitely agree with the idea of using huge pages where applicable. jemalloc 5 and later are hugepages-friendly. tcmalloc may someday be (I have seen versions of tcmalloc that are). As far as alignment goes, are there particular requirements? operator new[] always returns a pointer aligned as requested. The metadata and disadvantageous alignment you are thinking of comes from calling foo = new T[5] instead of operator new[] itself. The compiler adds metadata so you can later do an unsized delete as in delete[] foo. If you intend to call a sized delete, or to never delete, then you can call operator new[] directly.

I’m going to play around with this and see what happens if I replace FDB’s ::allocate with something else.

Stumbling over the answer to my own question: the results of ::allocate must be aligned to 4096, or the process will suicide.

ASSERT( int64_t(io->buf) % 4096 == 0 && io->offset % 4096 == 0 && io->nbytes % 4096 == 0 );

Which means that the operator new[](size_t, align_t, nothrow_t) form can be used, but that requires c++17.