I have created a tuning advice post on my blog based off of my (limited) experience, advice on this forum and from reading the source code: https://semisol.dev/blog/fdb-tuning/
Please share your thoughts, and experiences at scale that I could add.
I have created a tuning advice post on my blog based off of my (limited) experience, advice on this forum and from reading the source code: https://semisol.dev/blog/fdb-tuning/
Please share your thoughts, and experiences at scale that I could add.
Thank you for the information. I got some useful tips, but I was also left confused by a number of points because the recommendations didn’t always come with the reasoning behind it.
You are usually sharing TLog and SS on the same disk, making any impact worse. Non-local disks are still not recommended for larger clusters.
I cannot comment on writes, but I can push over 300MB/s reads on a cluster with 3 nodes, 4 SS/disk (2 disks), + 2 stateless + 1 TLog/disk
You should be easily able to achieve those writes with 3-5 nodes, and a dedicated TLog disk.
The reason to split is when you get after a certain point, you can’t keep a good chunk of the roles on a tight cluster of nodes, and it is more performant to split in that case.
This is performance, cost, and reliability. You can compensate for the lack of RAID with a higher replication factor; this is also better as the probability of failures looks like single node single disk > single node all disks > multi node single disks
Many things like PSU/power failures, kernel state corruption, or hardware issues could mean your disk is dead or otherwise corrupted with junk.
Cost is obviously better as instead of doing double with RAID 1 (4 disks) you can do triple with no RAID for much cheaper (3 disks).
Performance is better as well, because if one disk experiences worse performance, it only impacts SSes on that disk instead of all SSes sharing a RAID.
TLogs cannot benefit from RAID 0, it is primarily sequential-write-only (not a lot) with fsyncs being the main bottleneck. They benefit from enterprise drives more which have good fsync performance due to PLP.
RAID 0 could even bottleneck it more, as the fsync time on a RAID 0 is the highest of both disks.
You need 1 core for a stateless process, it is the highest it can use. Do not count HT “cores”, though keep HT on as there is the kernel + some background processes.
Yes. This is influenced by:
You should not share TLog disks with SS unless you have a low writes-per-read ratio and you are in a small cluster. (This is why I do it)
FDB already batches I/O to some extent. It is best to tune the amount of SSes depending on your load.
If you experience a very high run loop utilization on your SSes, try seeing if you can get higher performance by adding another SS to a disk disk, if it overall can process more, good, if it can’t, that is the saturaiton point.
For many new enterprise NVMes with Redwood, 4 works pretty well.
Redwood as a storage engine is currently unmaintained but it is stable. It can achieve very high range clear throughput, and the highest read performance, along with being somewhat good at compressing common prefixes.
RocksDB, being an LSM-based design, is not as good at range reads. It mostly has benefits in terms of storage usage efficiency if you want to squeeze every last GB.