What knobs should we set to increase cache efficiency

markus.pilman · September 20, 2018, 6:16pm

We are currently in the process of resizing our biggest FDB cluster. We will scale out (adding more hosts) and scale up (running on bigger machines).

Now the question we have is about main memory usage. Each storage process will have around 80GB of main memory at its disposal. However, I do not believe that the storage will use this memory efficiently by default.

So we plan to set the following knobs:

knob_page_cache_4k=50GB
knob_page_cache_64k=5GB
memory=80GB

Does this sound reasonable? Are there other knobs we should change to make fdbserver use the memory as efficiently as possible?

gaurav · September 21, 2018, 2:32am

just curious - how would OS page cache play out in this case, assuming that you did not increase fdb’s page caches explicitly?

markus.pilman · September 21, 2018, 2:35am

By default it doesn’t (at least on Linux). FDB uses AIO which, IIRC, bypasses the OS page cache. So you could give a 1TB of main memory to the process without seeing more caching. At least this is my understanding of the code and this is why I am asking this question.

markus.pilman · September 28, 2018, 5:47pm

so we did set up the two knobs I mentioned above (knob_page_cache_4k and knob_page_cache_64k) to a much higher value. This turned out to be a horrible idea:

After setting the knob we saw a cache hit rate of 99.99% and disk utilization went down dramatically. However: After a while CPU utilization went up by almost a factor of two and response time for read requests went down by roughly a factor of two.

We investigated the cause with perf top and saw that roughly 50% of CPU time was spent in AsyncFileCached::truncate. This is one part of that code:

for (auto p = pages.begin(); p != pages.end();) {
	if (p->first >= pageOffset) {
		auto f = p->second->truncate();
		if (!f.isReady() || f.isError()) actors.push_back(f);
		auto last = p;
		++p;
		pages.erase(last);
	} else
		++p;
}

This means, that whenever truncate is called, the storage node has to iterate through all cached pages (millions of them). Furthermore the truncate call to each page potentially results in an allocation (as this might create another actor).

So for now we run with a small cache again. The conclusion is, that FDB is does not scale up for now. We plan to fix this ASAP as we also can’t scale out much more in the short term.

alexmiller · September 28, 2018, 7:02pm

I think @SteavedHams had noted before that the ssd2 storage engine calls truncate much more often than you’d expect as well.

Well, that’ll do it.

Is this a fix you can push back as a PR, or should we file an issue for it to be addressed separately? Even on a smaller page cache, this still sounds like quite a waste of CPU.

markus.pilman · September 28, 2018, 9:08pm

I agree, this has to be optimized no matter what. Even with a small (aka default-sized) cache AsyncFileCached::truncate uses between 8 and 12% of the CPU according to perf top.

I looked through the code and thought about a fix but I don’t think it will be trivial to fix (but also not super-hard). And I think we’ll be able to share that.

ajbeamon · September 29, 2018, 2:30am

On the sqlite b-tree file (the file that’s going to be the largest and cause the most problems here), I think truncates are often done with a small number of pages. A sort of quick workaround may be to search the pages map for each page between your offset and the end of the file if the number of such pages is less than some reasonable threshold. If a proper, more general fix is going to be slower to put together, it may allow you to get an idea whether there’s anything else that’s going to be a problem after you resolve this.

The above approach will also avoid running through this loop when truncate is used to extend the file, which based on a quick reading of the code seems to be what it’s doing. I don’t know if truncate is used for this purpose in the ssd storage engine, but if it is then avoiding that could be an easy win.

As a historical note, the commented out code above the block you referenced was an implementation that used std::map instead of std::unordered_map. That allowed us to only touch the pages being truncated, but it of course made reads more expensive.

markus.pilman · October 2, 2018, 3:27pm

Ah that makes sense. I’ll probably go for this simpler optimization for now - this is actually a good idea. We currently have too much on our plate to fix this properly and that way we can do this next month or so.

killertypo · October 2, 2018, 10:51pm

does not scale up how? FDB does an amazing job of scaling depending on your workload and how you tune the DB.

gaurav · October 3, 2018, 2:35am

Markus has shown that a single instance of FDBServer process cannot efficiently make use of more RAM, if given.

markus.pilman · October 3, 2018, 5:48pm

Scale up means utilizing bigger machines. Scale out means utilizing more machines. FDB currently can’t utilize of a lot of memory for disk caching, therefore it can’t scale up at least with the SSD storage engine.

killertypo · October 3, 2018, 7:50pm

FDB scales quite well in all directions on SSD - though we’ve found it’s better to run a lot more SSD procs of a reasonable memory footprint (16G or so) vs trying to have one super sized proc.

markus.pilman · October 3, 2018, 11:47pm

yes, what you are saying is that it works better if you scale out instead of trying to scale up - so I think we are in wild agreement here

Anyways, the simple solution proposed by @ajbeamon is implemented and currently on it’s way to a large test-cluster. I will create a pull request as soon as we have some confidence that it actually works as intended (so far it looks good, but the cache is still cold, so we’ll wait another 12 hours or so)

killertypo · October 4, 2018, 2:53pm

Possibly, I think of it as scaling outwards as you scale up. We tend to drop incredibly large and powerful machines and then scale horizontally inside the machine to maximize the capacity and decrease the workloads per proc.

markus.pilman · October 4, 2018, 4:02pm

I understand, and we would like to do that as well, but we can’t. We use EBS and EBS is our main bottleneck. So we need to do two things:

Increase number of IOPS per process as much as possible.
Cache all hot data.

Both of these need somewhat bigger machines. There’s an upper limit of how many IOPS you get per machine and this limit depends on both, the type and the size of the EC2 instance. For caching, we need more memory (a cache of 10GB brings down the number of IOPS consumed by about half - on our workload).

Now we could use more storage processes, but there are two limitations:

We need to make sure that no two VMs end up on the same physical machine. Using anti-placement and availability zones brings us only so far. Assigning machine ids is problematic as well, as it reduces the number of TLogs we can run. We probably could run in a bit more complex and heterogeneous setting to achieve this, but this get complicated real quick.
You can only add a certain number of processes to an FDB cluster until the cluster controller runs out of CPU. Adding clients also contributes to the CC CPU. We are starting to get close to that limit (our experiments show that you can get around ~1500 servers and clients without destabilizing the cluster). We have patches on our way to production that make this somewhat better, but we are still at a point where scaling out becomes problematic.

So now, if we pay for big machines anyways, we would like to use them as much as possible. We can’t use the CPU cores, but we want to use at least the memory. And FDB should be able to use the memory for caching. With the fix mentioned above (for which I will make a pull request hopefully today) we can scale up much better.

I hope these explanations help to understand our motivations better.

panghy · October 4, 2018, 7:01pm

Ah, then I am sure Mike can tell you all about NVMes.

killertypo · October 4, 2018, 8:06pm

IO is always the bandwidth cap, what class of machine are you using? We use NVMe in front of our EBS volumes as a Write-Through/Read-Only cache. AWS does not differentiate between reads/writes so we can effectively hit around 99% cache hit rates (1NVME to 1EBS across an LVM RAID) with around 16 procs at 16G of memory per. This opens the floodgates for writes as all of your reads are hitting instance stores and no longer counting against your EBS volumes IO.

markus.pilman · October 4, 2018, 8:41pm

We are aware of what you are doing

We thought about using a NVME-cache through bcache as well. However, we can get a cache-hit rate of more than 99.99% by using main memory (we validated this in production but had to roll back because of the truncate problem). So we currently don’t need to use ephemeral disks for that. We only need a simple code change to make this work.

killertypo · October 4, 2018, 11:24pm

sounds reasonable, if the codefix works then it works - may work well to be able to utilize multiple paradigms to get the most bang for your buck as well haha. There is always a cost component on CPU when pushing it into memory (or at least we’ve found). If you have the memory to spare (we do not) it may work quite well in your design.

ajbeamon · October 5, 2018, 9:02pm

@markus.pilman I saw the pull request, do you have any details about how much it helped?

Topic		Replies	Views
Some Clarification on Storage Engine and Disk/IO Using FoundationDB	12	2313	July 23, 2019
FoundationDB 7.1.24 - the memory usage after clean startup of fdbserver process is too high Using FoundationDB	10	828	April 22, 2024
Cluster tuning cookbook Using FoundationDB	26	8859	February 1, 2019
Foundationdb 6.2 - fdbserver going out of memory Using FoundationDB	9	1028	April 23, 2020
Production optimizations Using FoundationDB	20	6440	August 15, 2018

What knobs should we set to increase cache efficiency

Related topics