O_DIRECT and what to do on filesystems that don't support it

(Alex Miller) #1

#274 and #842 are variations of the same issue of FDB requiring O_DIRECT, and some filesystems not supporting the option.

I saw mention that Wavefront has already had to work around this. @killertypo or @mrz, would you happen to be able to clarify what you ended up running internally to work around this?

(Clement Pang) #2

@alexmiller we just did the following:

--- flow/
+++ flow/
@@ -35,11 +35,11 @@
 			open_filename = filename + ".part";
-		int fd = ::open( open_filename.c_str(), openFlags(flags) | O_DIRECT, mode );
+		int fd = ::open( open_filename.c_str(), openFlags(flags) | O_SYNC, mode );
 		if (fd<0) {
 			Error e = errno==ENOENT ? file_not_found() : io_error();
 			TraceEvent("AsyncFileKAIOOpenFailed").detail("Filename", filename).detailf("Flags", "%x", flags)
-				.detailf("OSFlags", "%x", openFlags(flags) | O_DIRECT).detailf("mode", "0%o", mode).error(e).GetLastError();
+				.detailf("OSFlags", "%x", openFlags(flags) | O_SYNC).detailf("mode", "0%o", mode).error(e).GetLastError();
 			return e;
 		} else {

That seems to make it happy, this may not the right thing to do of course. :slight_smile:

(Clement Pang) #3

I should stress that we tested it and it works on ZFS but we don’t have any production workloads on ZFS yet.

(David Scherer) #4

O_SYNC really doesn’t make any sense, unless the filesystem is doing something strange with that flag. We call fsync() when we want data to be flushed to disk.

There are two reasons we use O_DIRECT:

  1. Most Linux filesystems don’t properly support kernel async I/O without O_DIRECT. For example, when a read can be satisfied from the page cache they will block in io_submit() while copying the data. This prevents keeping multiple I/O requests outstanding and absolutely kills performance. If ZFS supports async I/O properly without O_DIRECT (this seems to imply that it does, and furthermore supports async fsync which would also probably be worth enabling) then this one is a non-issue for you.

  2. It’s a waste of memory and memory bandwidth to have two levels of page caching (FDB’s internal cache and the operating system page cache). This could maybe be mitigated to some extent by decreasing the size of one or the other.

There’s also some reason to think that copy-on-write filesystems are not optimal under btrees (it’s another case of several layers doing the same thing, with the btree pager, the filesystem, and the SSD firmware all exposing update-in-place interfaces while doing copy on write underneath, to the detriment of performance and flash lifetime). But this is probably a question for empirical benchmarking.

(Clement Pang) #5

Yeah, O_SYNC on ext4 would be bad but I am not sure if we saw a huge degradation in performance with O_SYNC on ZFS (seems like ZFS on O_SYNC would still flush on every write though so I would assume it would be a lot worse than O_DIRECT + write_cache + occasional fsync).

(A.J. Beamon) #6

It looks like an implementation for O_DIRECT support has been merged recently in ZFS, so we may have something to look forward to there:


As mentioned in my issue, requiring O_DIRECT is problematic for FUSE volumes (which includes Docker for Mac, possibly Docker for Desktop on Windows). This is in a development environment where performance is not a priority.

(Alex Miller) #8

If there was an alternative that existed where the only penalty was cache pollution, it opens the door to automatically falling back if O_DIRECT is not supported and logging a loud warning in some fashion. That appears to not be the case, so I think you’ll get your flag.

(Clement Pang) #9

Yeah, internally that’s what we have been saying (0.8 for zfs will have O_DIRECT)