FoundationDB tuning advice

I have created a tuning advice post on my blog based off of my (limited) experience, advice on this forum and from reading the source code: https://semisol.dev/blog/fdb-tuning/

Please share your thoughts, and experiences at scale that I could add.

6 Likes

Thank you for the information. I got some useful tips, but I was also left confused by a number of points because the recommendations didn’t always come with the reasoning behind it.

  1. Why are non-local disks ok for larger clusters, but bad for small clusters?
  2. I’m unsure about the reasoning behind splitting services between nodes. Typically my nodes can take 8 or 12 nvme drives and have 24 or 48 cores. I have plenty of cores to run stateless services in addition to SS and TLog. Will I not save bandwidth by having multiple services on 1 node? Or, how should I quantify my bandwidth needs relative to my raw read/write rates? How much bandwidth do I need in a 12 node cluster to sustain 100 MBps writes and 100 MBps reads.
  3. For the no RAID suggestion, is this based on cost per GB stored, performance, reliability or ? The TLogs seem like they could benefit from RAID 0, or is this unlikely to be a limitation on the write path?
  4. The remarks about stateless services being CPU intensive sound like I should choose high frequency cores over high core count.what is a good core count minimum for a stateless node.
  5. 8GB per stateless process, do you see usage getting that high, and what are causes?
  6. The amount of SS processes per disk - what is the reasoning here? The drives can sustain 100 parallel commands, so I was under the impression that I could have many SS processes per disk. If the TLog takes 1 process and I have 47 cores remaining and 6-7 disks, why not run 8 SS processes per disk?
  7. Redwood storage - It looks like upcoming changes will give clusters with RocksDB more capabilities, and Redwood isn’t seeing active development. Why isn’t RocksDB the recommended choice?
2 Likes

You are usually sharing TLog and SS on the same disk, making any impact worse. Non-local disks are still not recommended for larger clusters.

I cannot comment on writes, but I can push over 300MB/s reads on a cluster with 3 nodes, 4 SS/disk (2 disks), + 2 stateless + 1 TLog/disk

You should be easily able to achieve those writes with 3-5 nodes, and a dedicated TLog disk.

The reason to split is when you get after a certain point, you can’t keep a good chunk of the roles on a tight cluster of nodes, and it is more performant to split in that case.

This is performance, cost, and reliability. You can compensate for the lack of RAID with a higher replication factor; this is also better as the probability of failures looks like single node single disk > single node all disks > multi node single disks

Many things like PSU/power failures, kernel state corruption, or hardware issues could mean your disk is dead or otherwise corrupted with junk.

Cost is obviously better as instead of doing double with RAID 1 (4 disks) you can do triple with no RAID for much cheaper (3 disks).

Performance is better as well, because if one disk experiences worse performance, it only impacts SSes on that disk instead of all SSes sharing a RAID.

TLogs cannot benefit from RAID 0, it is primarily sequential-write-only (not a lot) with fsyncs being the main bottleneck. They benefit from enterprise drives more which have good fsync performance due to PLP.
RAID 0 could even bottleneck it more, as the fsync time on a RAID 0 is the highest of both disks.

You need 1 core for a stateless process, it is the highest it can use. Do not count HT “cores”, though keep HT on as there is the kernel + some background processes.

Yes. This is influenced by:

  • cache size
  • your storage engine (Redwood keeps a bunch of data in-memory)
  • your KV area size (byte sample)
  • your write throughput. The storage server has to keep in memory all transactions in the last 5 seconds to allow reading from old versions.

You should not share TLog disks with SS unless you have a low writes-per-read ratio and you are in a small cluster. (This is why I do it)

FDB already batches I/O to some extent. It is best to tune the amount of SSes depending on your load.
If you experience a very high run loop utilization on your SSes, try seeing if you can get higher performance by adding another SS to a disk disk, if it overall can process more, good, if it can’t, that is the saturaiton point.
For many new enterprise NVMes with Redwood, 4 works pretty well.

Redwood as a storage engine is currently unmaintained but it is stable. It can achieve very high range clear throughput, and the highest read performance, along with being somewhat good at compressing common prefixes.

RocksDB, being an LSM-based design, is not as good at range reads. It mostly has benefits in terms of storage usage efficiency if you want to squeeze every last GB.

1 Like