O_DIRECT and what to do on filesystems that don't support it

#274 and #842 are variations of the same issue of FDB requiring O_DIRECT, and some filesystems not supporting the option.

I saw mention that Wavefront has already had to work around this. @killertypo or @mrz, would you happen to be able to clarify what you ended up running internally to work around this?

@alexmiller we just did the following:

===================================================================
--- flow/AsyncFileKAIO.actor.h
+++ flow/AsyncFileKAIO.actor.h
@@ -35,11 +35,11 @@
 			open_filename = filename + ".part";
 		}
 
-		int fd = ::open( open_filename.c_str(), openFlags(flags) | O_DIRECT, mode );
+		int fd = ::open( open_filename.c_str(), openFlags(flags) | O_SYNC, mode );
 		if (fd<0) {
 			Error e = errno==ENOENT ? file_not_found() : io_error();
 			TraceEvent("AsyncFileKAIOOpenFailed").detail("Filename", filename).detailf("Flags", "%x", flags)
-				.detailf("OSFlags", "%x", openFlags(flags) | O_DIRECT).detailf("mode", "0%o", mode).error(e).GetLastError();
+				.detailf("OSFlags", "%x", openFlags(flags) | O_SYNC).detailf("mode", "0%o", mode).error(e).GetLastError();
 			return e;
 		} else {
 			TraceEvent("AsyncFileKAIOOpen")

That seems to make it happy, this may not the right thing to do of course. :slight_smile:

I should stress that we tested it and it works on ZFS but we don’t have any production workloads on ZFS yet.

O_SYNC really doesn’t make any sense, unless the filesystem is doing something strange with that flag. We call fsync() when we want data to be flushed to disk.

There are two reasons we use O_DIRECT:

  1. Most Linux filesystems don’t properly support kernel async I/O without O_DIRECT. For example, when a read can be satisfied from the page cache they will block in io_submit() while copying the data. This prevents keeping multiple I/O requests outstanding and absolutely kills performance. If ZFS supports async I/O properly without O_DIRECT (this seems to imply that it does, and furthermore supports async fsync which would also probably be worth enabling) then this one is a non-issue for you.

  2. It’s a waste of memory and memory bandwidth to have two levels of page caching (FDB’s internal cache and the operating system page cache). This could maybe be mitigated to some extent by decreasing the size of one or the other.

There’s also some reason to think that copy-on-write filesystems are not optimal under btrees (it’s another case of several layers doing the same thing, with the btree pager, the filesystem, and the SSD firmware all exposing update-in-place interfaces while doing copy on write underneath, to the detriment of performance and flash lifetime). But this is probably a question for empirical benchmarking.

Yeah, O_SYNC on ext4 would be bad but I am not sure if we saw a huge degradation in performance with O_SYNC on ZFS (seems like ZFS on O_SYNC would still flush on every write though so I would assume it would be a lot worse than O_DIRECT + write_cache + occasional fsync).

It looks like an implementation for O_DIRECT support has been merged recently in ZFS, so we may have something to look forward to there:

https://github.com/zfsonlinux/zfs/issues/224

As mentioned in my issue, requiring O_DIRECT is problematic for FUSE volumes (which includes Docker for Mac, possibly Docker for Desktop on Windows). This is in a development environment where performance is not a priority.

If there was an alternative that existed where the only penalty was cache pollution, it opens the door to automatically falling back if O_DIRECT is not supported and logging a loud warning in some fashion. That appears to not be the case, so I think you’ll get your flag.

1 Like

Yeah, internally that’s what we have been saying (0.8 for zfs will have O_DIRECT)

Are there any plans for supporting block volumes intead of filesystems? They may improve performance and reliability and may be an alternative of file io with O_DIRECT.

Not in the very recent future. Also I don’t think anyone is going to implement this for the current ssd storage engine.

However, I believe @SteavedHams was talking about that for Redwood at one time. I don’t know whether there are specific plans.

From what some of us have measured empirically, compared to ext4 and zfs there would be a performance improvement in using a block device directly if for no other reason than io_submit() tends to block briefly most of the time. This is bad for FDB since it blocks the main thread. Using AIO on a block device directly has the least chance of blocking.

Redwood stores its entire state in a single file, tracks its own free space internally, and reads/writes using a configurable block size, so it should be able to use a block device directly.

While it wouldn’t be too difficult to try this out with some test code, having FDB storage servers use block devices is going to take a lot more effort. Off the top of my head, the FDB worker would no longer be able to:

  • get a list of storage engines and their storage server IDs (currently in the filename) on a host
  • instance more than one storage engine at a time during storage engine migrations (unless it has multiple block devices available to use but that would be wasteful in the normal case)

May be some other solutions for storing IDs. Ex - storing a small amount of metadata in the filesystem
while storing the key-value storage in a block device.

There is not a big problem to have several block devices during a maintenance. LVM may be used for this purpuse. It allows to allocate new logical block volumes and delete them. LVM is also useful for making RAIDs, striping, snapshots and so on.

I think the main advantage of block devices is not to waste extra memory for filesystem cache. OS behavior is unpredictable when there is a large iops volume through this cache.

Yes, that is one way of solving this issue.

LVM itself comes with quite some overhead if you use these features (for snapshots writes will be at least twice as expensive). So you’ll probably rather pay for the overhead of having a filesystem.

This is what O_DIRECT is giving us, With O_DIRECT the file system cache is bypassed. Typically the main benefit of using a block device is that you get more fine-grained control (and can therefore control stuff like fragmentation better). In the days of modern SSDs, I am not convinced the benefits of using a block device warrant the engineering effort required to make this work. Another problem with that is that using block devices make virtualization and containerization harder (by how much I don’t know). So in the work where people want to run FDB in the cloud, this is a serious drawback.