Block storage: Full fault tolerance for "legacy" applications

So you have an application that can run in a container or virtual machine. It’s stateful - maybe it has a Postgres or MySQL database embedded in it. Or both, probably. You got it from a vendor or it was built by someone who left your company years ago. It works fine, and you don’t anticipate needing to scale it beyond the resources you can give a single VM. You certainly aren’t interested in porting the application to use a different database. But you don’t want to worry about operating it and you want it to keep working despite machine or even datacenter failures.

So run it in a virtual machine which mounts a network block device (e.g. using Linux’s NBD protocol). Connect it to a layer like https://github.com/spullara/nbd that implements NBD on top of FoundationDB. And then connect it to a FoundationDB cluster with the fault tolerance properties you want.

There are a few things needed to fully realize this vision:

  1. That nbd layer needs to be made rock solid. For example, it should validate its lease in every transaction.

  2. I think the best leasing strategy is that the last attempt to mount a volume always succeeds (while ensuring that any subsequent write attempts by the previous owner will block forever or terminate its VM). The idea is that if the application could survive being rebooted (it uses fsync() correctly if necessary) it can survive being replaced this way. Then the health of the application is monitored externally by orchestration software and if it fails, a new instance is started, mounts the volume, and proceeds

2b. An alternative design is that an attempt to mount a volume gets in line and waits for any existing mounter to either end or fail to update a lease key in a timely way. You still ensure that in the latter case the original mounter is prevented from making writes. Orchestration software always keeps an extra instance around trying to take over, and when it detects a health problem with the existing instance it tries to shut it down. I think this is a less reliable approach.

  1. FoundationDB should support satellite replication (coming in 6.0?) so that you can do practical, performant multi-region fault tolerance. Then failover of such legacy applications can be done automatically and safely even in most region failure scenarios. This is where the benefit of this strategy over existing cloud block stores fully pays off.

  2. Maybe FoundationDB’s storage engine (or a storage engine) could do a more efficient job of storing lots of exactly 4KiB values.

  3. This solution as a whole should be packaged up and integrated with cluster management software so that it is easy to install a legacy application and get this extremely high level of fault tolerance automatically.

  4. It should be possible to extend this vision to “clusters” of legacy virtual machines as well

Thoughts?

7 Likes

I’ll take some of this commentary as feature requests.

10 years ago I took SIMH (old minicomputer emulator for things like PDP-11/VAX) and using .net had it’s storage backend using Microsoft SQL Server as the block access. It was great as you know that I could now magically have a highly available legacy environment setup with SQL replication.

But with things being far more open these days, this is the perfect answer to block level replication of VMs, and other block devices.

Now that we have Qemu+KVM + the recent FoundationDB making this kind of thing far more practical than it was back then. I should have written more about this kind of thing 10+ years ago, but seeing this show’s how awesome it is!

I’ve just released a prototype of a FoundationDB block device written in Go: https://github.com/meln1k/foundationdb-block-device

@spullara your nbd project inspired me to implement it as a block device in user space :slight_smile:

2 Likes

Super cool! Even more fault tolerant as you don’t have the single network connection to the NBD server. Well done.