Configure fdb to write to disk only once every 10 mins

I am trying to setup a fdb cluster in such a way it can handle 1M small writes / second.

Is it possible to configure the cluster in such a way so that:

  1. writes, by default, go only into memory, and not into SSD
  2. every 10 mins or so, all transactions get flushed out to SSD

So I’m okay with a world where a crash results in losing 10 mins worth of transactions as long as each transaction is all-or-nothing. I am willing to make this tradeoff in exchange for more performance.

There is a “memory” mode where FDB persists all writes to the on-disk WAL, but the actual DB is stored in memory. This may be faster for writes, though I’m not sure.

In theory (though not sure if inpractice possible with fdb), I think we can go a step further.

Within a 10 minute block, only do wries to memory, don’t touch on-disk WAL. Every 10 minutes, compact all the transactions/diffs within the 10 minute block, and write them out to the on-disk WAL. (So if a particular hot key is written to 100 times in the 10 minute block, we only write the last entry).

Is there some way to do to this in fdb? Where I’m explicitly saying: it’s okay to lose the last 10 minutes of writes, as long as each transactions are all-or-nothing.

You might be able to achieve this on the filesystem level: how about setting up the log and storage process data directory on a ZFS partition with sync=disabled and configure the flush timeout to 10 minutes from the default 5 seconds in ZFS? This will be completely transparent to fdb. When fdb calls fsync it will return immediately, but in reality it will only be in memory. Not sure how it will behave after a crash though.

1 Like

Short answer: I am almost certain you can’t do that without code changes.

Longer answer:

there’s two systems involved in the write path: the transaction log and the storage servers. You have to look at them separately. The storage servers only write to disk every ~500ms IIRC. This is beneficial, since writing to a B-tree is usually faster if you do it in batches and the MVCC window is in memory anyways. You probably could set this interval to 10 minutes and it would just work (assuming you have enough memory). Though it will increase pressure on the transaction log which will result in longer recovery times (but it sounds you’re fine with that).

The bigger problem is the transaction log: FDB will simply refuse to acknowledge a commit before it has been written to all transaction logs and a fsync returned successfully. Doing so would violate the ACID guarantees and such a feature simply doesn’t exist.

I am not sure whether this will work, but keep in mind that we use O_DIRECT – not sure how the file system implementation will exactly handle this situation.

This being said: btrfs and zfs probably don’t perform well for FDB (or any B-tree). So my guess is that you’ll get worse performance even if it works (I am not criticizing these file systems, they’re amazing and I am a big fan, but they’re not built for B-trees – for a storage engine you want something simple and ZFS is basically its own storage engine).

2 Likes