Allow range reads without cache pollution

(Ryan Worl) #1

Since FDB uses async IO and has its own page cache, it could choose to expose an API to users which, when performing a range read, doesn’t cache any disk pages read while performing the scan.

This is a “dangerous” feature if you don’t know what you’re doing, but it also has the potential to allow background workloads to cooperate efficiently with interactive workloads.

The reasoning is that the working set of an application will always be in cache under normal circumstances, but if a large background range scan comes along and reads as many pages as are available in the page cache, you’ve thrown away your entire working set.

This is relevant for certain existing workloads like performing backups, but may be useful in more places.

Does this already exist, and if not, does it seem like a worthwhile feature to implement?

(gaurav) #2

This seems like a good control to have when implementing multiple diverse application layers on top of a shared fdb instance.

In addition to what @ryanworl asked, are the pages being written to cached in fdb’s page cache? If so, it may be useful to provide a way to not do that (on a per transaction level).

For instance, I am planning to run two different applications on a single fdb instance - (a) a read heavy OLTP kind of layer and (b) a write heavy metric timeseries storage layer.

The metric layer will be made up of 99% writes and the OLTP layer will be made up of 99% reads; it is probably desirable that metric layer does not disturb the fdb page caches (so that page caches could be more useful to help the reads from OLTP application).

(Ryan Worl) #3

My reading of AsyncFileCached::read_write_impl makes me think yes, writes get cached.

(Alec Grieser) #4

One other consideration: I think you probably want this solution to distinguish between leaf pages and interior pages in the B-tree. I think you want your large background range scan to continue to place interior pages in the cache as (1) that scan is probably going to need those pages soon anyway and (2) so will other users. The worst case scenario (with caching disabled on interior pages) would be something like your background scan keeps making incremental progress, but it also keeps loading all of the pages it needs from disk (including, say, the second level of the B-tree) and thus consuming an outsized portion of disk resources. I suppose this wouldn’t matter on an infinitely parallelized I/O system with async I/O, but, like, that’s not most systems. In the case of a write-heavy workload blowing away the cache (as @gaurav brings up), not caching interior pages could also dramatically slow down the write bandwidth of a single storage server, which could lead to it falling behind and then a fair number of other performance issues.

Page cache maintenance is a pretty fundamental part of database system design, so there must be papers on this somewhere. Something like, “Buffer cache maintenance in the presence of full table scans”.

(Ryan Worl) #5

The search terms I’ve found relevant are “sequential flooding” and “priority hints”.

(Ryan Worl) #6

Also, I think it would be reasonable for the storage engine to take the client request of “don’t cache this range read” and covert that to “in my b-tree don’t cache leaf pages” unless I’ve gotten my layering wrong. Same for write caching.

It would require changing the interface for IKeyValueStore and IAsyncFile though.