Transparent Huge Pages Performance Impact

I’d just like to share some results from our in-house experiments with Transparent Huge Pages (THP).

FoundationDB calls malloc() extensively, which by default uses 4KB pages on Linux.
Because the in-memory structures / caches are fairly large, often in the order of hundred MB,
accessing random 4KB pages will cause quite a few TLB misses.

According to our in-house benchmark, THP reduces the TLB misses and improve the throughput noticeably.

GET: 274016 reads/sec => 311280 reads/sec (+13.6%)
GET RANGE (10-key range): 183680 reads/sec => 205472 reads/sec (+11.9%)

We used the “always” policy in the experiments above.
# echo "always" > /sys/kernel/mm/transparent_hugepage/enabled

On a related note, there was once some (broken) explicit, non-transparent huge page support in FDB that you might be interested in. See #909.

A solid 12% boost is very nice! It would be interesting to have some details on the cluster setup you’re using and the underlying benchmark here. It would also be interesting to see some resulting data from /proc/*/smaps or something to give a breakdown on huge page usage.

Regarding #909, the problem seems to be one of internal accounting not taking 2MB pages into account, wasting space, but also unpredictable/unknown behavior/performance when a huge page allocation fails and/or “magazine” sizes are mixed. I don’t think there’s any way to quantify those things without implementing the necessary accounting and running a lot of tests.

FoundationDB currently allocates huge magazines using mmap with MAP_HUGETLB, hence the need for this accounting. It’s unclear to me how much usage malloc or the standard memory allocator for C++ are used in critical paths, from what I can tell, but those are separate heaps and codepaths. It would be interesting to see where most of the benefits come from. (Perhaps the upcoming BCC probe work and some bpf magic can help us one day…)

Here’s a final thought: rather than forcing transparent huge pages across the system, perhaps FoundationDB can instead call madvise(MADV_HUGEPAGE) across the necessary address spaces. This still allows per-process THP usage across the system while removing the need to account for things – in return the huge page kernel threads are needed. I don’t know how khugepaged picks pages, but madvise requires anon mmap(2) pages you allocate, so maybe it can do more on its own than madvise can since it will recognize mmap calls from anywhere that satisfy those criteria (including the libc allocator, which FoundationDB can’t touch.)

We’ve discussed about hand-picking the segments to use THP via madvise(), but we first wanted to see how much THP would give us without any code changes. I fully agree that madvise() is more preferable in general. Also, we need to be more careful about memory alignment in order to fully utilize THP. Here’s an smaps for a storage server process (not from the benchmark, but our development machine). There are 170 small or unaligned segments.

170 AnonHugePages:         0 kB
  7 AnonHugePages:      2048 kB
  3 AnonHugePages:      4096 kB
  4 AnonHugePages:      6144 kB
  1 AnonHugePages:     96256 kB
  1 AnonHugePages:    223232 kB
  1 AnonHugePages:    919552 kB

For the benchmark, we used a minimal triple redundancy cluster with 3 tlogs and 3 storages on 6 different i3 instances. We used our in-house C program. We have a tester workload version of the same benchmark almost ready to be merged.