What knobs should we set to increase cache efficiency

We are currently in the process of resizing our biggest FDB cluster. We will scale out (adding more hosts) and scale up (running on bigger machines).

Now the question we have is about main memory usage. Each storage process will have around 80GB of main memory at its disposal. However, I do not believe that the storage will use this memory efficiently by default.

So we plan to set the following knobs:


Does this sound reasonable? Are there other knobs we should change to make fdbserver use the memory as efficiently as possible?

1 Like

just curious - how would OS page cache play out in this case, assuming that you did not increase fdb’s page caches explicitly?

By default it doesn’t (at least on Linux). FDB uses AIO which, IIRC, bypasses the OS page cache. So you could give a 1TB of main memory to the process without seeing more caching. At least this is my understanding of the code and this is why I am asking this question.

so we did set up the two knobs I mentioned above (knob_page_cache_4k and knob_page_cache_64k) to a much higher value. This turned out to be a horrible idea:

After setting the knob we saw a cache hit rate of 99.99% and disk utilization went down dramatically. However: After a while CPU utilization went up by almost a factor of two and response time for read requests went down by roughly a factor of two.

We investigated the cause with perf top and saw that roughly 50% of CPU time was spent in AsyncFileCached::truncate. This is one part of that code:

for (auto p = pages.begin(); p != pages.end();) {
	if (p->first >= pageOffset) {
		auto f = p->second->truncate();
		if (!f.isReady() || f.isError()) actors.push_back(f);
		auto last = p;
	} else

This means, that whenever truncate is called, the storage node has to iterate through all cached pages (millions of them). Furthermore the truncate call to each page potentially results in an allocation (as this might create another actor).

So for now we run with a small cache again. The conclusion is, that FDB is does not scale up for now. We plan to fix this ASAP as we also can’t scale out much more in the short term.

I think @SteavedHams had noted before that the ssd2 storage engine calls truncate much more often than you’d expect as well.

Well, that’ll do it.

Is this a fix you can push back as a PR, or should we file an issue for it to be addressed separately? Even on a smaller page cache, this still sounds like quite a waste of CPU.

I agree, this has to be optimized no matter what. Even with a small (aka default-sized) cache AsyncFileCached::truncate uses between 8 and 12% of the CPU according to perf top.

I looked through the code and thought about a fix but I don’t think it will be trivial to fix (but also not super-hard). And I think we’ll be able to share that.

On the sqlite b-tree file (the file that’s going to be the largest and cause the most problems here), I think truncates are often done with a small number of pages. A sort of quick workaround may be to search the pages map for each page between your offset and the end of the file if the number of such pages is less than some reasonable threshold. If a proper, more general fix is going to be slower to put together, it may allow you to get an idea whether there’s anything else that’s going to be a problem after you resolve this.

The above approach will also avoid running through this loop when truncate is used to extend the file, which based on a quick reading of the code seems to be what it’s doing. I don’t know if truncate is used for this purpose in the ssd storage engine, but if it is then avoiding that could be an easy win.

As a historical note, the commented out code above the block you referenced was an implementation that used std::map instead of std::unordered_map. That allowed us to only touch the pages being truncated, but it of course made reads more expensive.

Ah that makes sense. I’ll probably go for this simpler optimization for now - this is actually a good idea. We currently have too much on our plate to fix this properly and that way we can do this next month or so.

does not scale up how? FDB does an amazing job of scaling depending on your workload and how you tune the DB.

Markus has shown that a single instance of FDBServer process cannot efficiently make use of more RAM, if given.

Scale up means utilizing bigger machines. Scale out means utilizing more machines. FDB currently can’t utilize of a lot of memory for disk caching, therefore it can’t scale up at least with the SSD storage engine.

FDB scales quite well in all directions on SSD - though we’ve found it’s better to run a lot more SSD procs of a reasonable memory footprint (16G or so) vs trying to have one super sized proc.

yes, what you are saying is that it works better if you scale out instead of trying to scale up - so I think we are in wild agreement here :wink:

Anyways, the simple solution proposed by @ajbeamon is implemented and currently on it’s way to a large test-cluster. I will create a pull request as soon as we have some confidence that it actually works as intended (so far it looks good, but the cache is still cold, so we’ll wait another 12 hours or so)

1 Like

Possibly, I think of it as scaling outwards as you scale up. We tend to drop incredibly large and powerful machines and then scale horizontally inside the machine to maximize the capacity and decrease the workloads per proc.

I understand, and we would like to do that as well, but we can’t. We use EBS and EBS is our main bottleneck. So we need to do two things:

  1. Increase number of IOPS per process as much as possible.
  2. Cache all hot data.

Both of these need somewhat bigger machines. There’s an upper limit of how many IOPS you get per machine and this limit depends on both, the type and the size of the EC2 instance. For caching, we need more memory (a cache of 10GB brings down the number of IOPS consumed by about half - on our workload).

Now we could use more storage processes, but there are two limitations:

  1. We need to make sure that no two VMs end up on the same physical machine. Using anti-placement and availability zones brings us only so far. Assigning machine ids is problematic as well, as it reduces the number of TLogs we can run. We probably could run in a bit more complex and heterogeneous setting to achieve this, but this get complicated real quick.
  2. You can only add a certain number of processes to an FDB cluster until the cluster controller runs out of CPU. Adding clients also contributes to the CC CPU. We are starting to get close to that limit (our experiments show that you can get around ~1500 servers and clients without destabilizing the cluster). We have patches on our way to production that make this somewhat better, but we are still at a point where scaling out becomes problematic.

So now, if we pay for big machines anyways, we would like to use them as much as possible. We can’t use the CPU cores, but we want to use at least the memory. And FDB should be able to use the memory for caching. With the fix mentioned above (for which I will make a pull request hopefully today) we can scale up much better.

I hope these explanations help to understand our motivations better.

Ah, then I am sure Mike can tell you all about NVMes. :slight_smile:

IO is always the bandwidth cap, what class of machine are you using? We use NVMe in front of our EBS volumes as a Write-Through/Read-Only cache. AWS does not differentiate between reads/writes so we can effectively hit around 99% cache hit rates (1NVME to 1EBS across an LVM RAID) with around 16 procs at 16G of memory per. This opens the floodgates for writes as all of your reads are hitting instance stores and no longer counting against your EBS volumes IO.

We are aware of what you are doing :wink:

We thought about using a NVME-cache through bcache as well. However, we can get a cache-hit rate of more than 99.99% by using main memory (we validated this in production but had to roll back because of the truncate problem). So we currently don’t need to use ephemeral disks for that. We only need a simple code change to make this work.

sounds reasonable, if the codefix works then it works - may work well to be able to utilize multiple paradigms to get the most bang for your buck as well haha. There is always a cost component on CPU when pushing it into memory (or at least we’ve found). If you have the memory to spare (we do not) it may work quite well in your design.

@markus.pilman I saw the pull request, do you have any details about how much it helped?