FoundationDB tuning advice

Semisol · July 17, 2025, 7:36am

I have created a tuning advice post on my blog based off of my (limited) experience, advice on this forum and from reading the source code: https://semisol.dev/blog/fdb-tuning/

Please share your thoughts, and experiences at scale that I could add.

RossWilliams · October 30, 2025, 12:33pm

Thank you for the information. I got some useful tips, but I was also left confused by a number of points because the recommendations didn’t always come with the reasoning behind it.

Why are non-local disks ok for larger clusters, but bad for small clusters?
I’m unsure about the reasoning behind splitting services between nodes. Typically my nodes can take 8 or 12 nvme drives and have 24 or 48 cores. I have plenty of cores to run stateless services in addition to SS and TLog. Will I not save bandwidth by having multiple services on 1 node? Or, how should I quantify my bandwidth needs relative to my raw read/write rates? How much bandwidth do I need in a 12 node cluster to sustain 100 MBps writes and 100 MBps reads.
For the no RAID suggestion, is this based on cost per GB stored, performance, reliability or ? The TLogs seem like they could benefit from RAID 0, or is this unlikely to be a limitation on the write path?
The remarks about stateless services being CPU intensive sound like I should choose high frequency cores over high core count.what is a good core count minimum for a stateless node.
8GB per stateless process, do you see usage getting that high, and what are causes?
The amount of SS processes per disk - what is the reasoning here? The drives can sustain 100 parallel commands, so I was under the impression that I could have many SS processes per disk. If the TLog takes 1 process and I have 47 cores remaining and 6-7 disks, why not run 8 SS processes per disk?
Redwood storage - It looks like upcoming changes will give clusters with RocksDB more capabilities, and Redwood isn’t seeing active development. Why isn’t RocksDB the recommended choice?

Semisol · November 29, 2025, 12:10pm

You are usually sharing TLog and SS on the same disk, making any impact worse. Non-local disks are still not recommended for larger clusters.

I cannot comment on writes, but I can push over 300MB/s reads on a cluster with 3 nodes, 4 SS/disk (2 disks), + 2 stateless + 1 TLog/disk

You should be easily able to achieve those writes with 3-5 nodes, and a dedicated TLog disk.

The reason to split is when you get after a certain point, you can’t keep a good chunk of the roles on a tight cluster of nodes, and it is more performant to split in that case.

This is performance, cost, and reliability. You can compensate for the lack of RAID with a higher replication factor; this is also better as the probability of failures looks like single node single disk > single node all disks > multi node single disks

Many things like PSU/power failures, kernel state corruption, or hardware issues could mean your disk is dead or otherwise corrupted with junk.

Cost is obviously better as instead of doing double with RAID 1 (4 disks) you can do triple with no RAID for much cheaper (3 disks).

Performance is better as well, because if one disk experiences worse performance, it only impacts SSes on that disk instead of all SSes sharing a RAID.

TLogs cannot benefit from RAID 0, it is primarily sequential-write-only (not a lot) with fsyncs being the main bottleneck. They benefit from enterprise drives more which have good fsync performance due to PLP.
RAID 0 could even bottleneck it more, as the fsync time on a RAID 0 is the highest of both disks.

You need 1 core for a stateless process, it is the highest it can use. Do not count HT “cores”, though keep HT on as there is the kernel + some background processes.

Yes. This is influenced by:

cache size
your storage engine (Redwood keeps a bunch of data in-memory)
your KV area size (byte sample)
your write throughput. The storage server has to keep in memory all transactions in the last 5 seconds to allow reading from old versions.

You should not share TLog disks with SS unless you have a low writes-per-read ratio and you are in a small cluster. (This is why I do it)

FDB already batches I/O to some extent. It is best to tune the amount of SSes depending on your load.
If you experience a very high run loop utilization on your SSes, try seeing if you can get higher performance by adding another SS to a disk disk, if it overall can process more, good, if it can’t, that is the saturaiton point.
For many new enterprise NVMes with Redwood, 4 works pretty well.

Redwood as a storage engine is currently unmaintained but it is stable. It can achieve very high range clear throughput, and the highest read performance, along with being somewhat good at compressing common prefixes.

RocksDB, being an LSM-based design, is not as good at range reads. It mostly has benefits in terms of storage usage efficiency if you want to squeeze every last GB.

mjurkus · December 11, 2025, 12:45pm

Hi,

I suspect that my client does bottleneck, but there is no way to retrieve fdb_database_get_main_thread_busyness in Go client.

And you mention “You should be using multiple client threads by using the multi-version client.”. I was considering using this in my use case, but when reading the documentation, I assumed it is used when different client versions are used.

I’m using Go client, and the doc states:

// Spawns multiple worker threads for each version of the client that is loaded.  Setting this to a number greater than one implies disable_local_client.
//
// Parameter: Number of client threads to be spawned.  Each cluster will be serviced by a single client thread.
func (o NetworkOptions) SetClientThreadsPerVersion(param int64) error {
	return o.setOpt(65, int64ToBytes(param))
}

So should I load the client via fdb.Options().SetExternalClientDirectory and then SetClientThreadsPerVersion to have more network threads?

Semisol · December 12, 2025, 4:20pm

I have a patch that supports this that I will hopefully be able to open source

Each client version also can get a thread of n pools, with SetClientThreadsPerVersion. This is assigned in a round-robin fashion to a thread at a time every time you create an fdb.Database, so you should create first n clients for DB 1, n for DB 2, if you do multi-cluster.
And you need to handle balancing transactions between clients yourself.

Yes

Topic		Replies	Views
Production optimizations Using FoundationDB	20	6570	August 15, 2018
Designing a new FoundationDB server Using FoundationDB	3	402	December 14, 2023
Cluster tuning cookbook Using FoundationDB	26	8986	February 1, 2019
How to speed up balancing? Using FoundationDB performance	11	1609	August 21, 2019
WARNING: A single process is both a transaction log and a storage server Using FoundationDB	16	1870	August 13, 2019

FoundationDB tuning advice

Related topics