I have created a tuning advice post on my blog based off of my (limited) experience, advice on this forum and from reading the source code: https://semisol.dev/blog/fdb-tuning/
Please share your thoughts, and experiences at scale that I could add.
Thank you for the information. I got some useful tips, but I was also left confused by a number of points because the recommendations didn’t always come with the reasoning behind it.
Why are non-local disks ok for larger clusters, but bad for small clusters?
I’m unsure about the reasoning behind splitting services between nodes. Typically my nodes can take 8 or 12 nvme drives and have 24 or 48 cores. I have plenty of cores to run stateless services in addition to SS and TLog. Will I not save bandwidth by having multiple services on 1 node? Or, how should I quantify my bandwidth needs relative to my raw read/write rates? How much bandwidth do I need in a 12 node cluster to sustain 100 MBps writes and 100 MBps reads.
For the no RAID suggestion, is this based on cost per GB stored, performance, reliability or ? The TLogs seem like they could benefit from RAID 0, or is this unlikely to be a limitation on the write path?
The remarks about stateless services being CPU intensive sound like I should choose high frequency cores over high core count.what is a good core count minimum for a stateless node.
8GB per stateless process, do you see usage getting that high, and what are causes?
The amount of SS processes per disk - what is the reasoning here? The drives can sustain 100 parallel commands, so I was under the impression that I could have many SS processes per disk. If the TLog takes 1 process and I have 47 cores remaining and 6-7 disks, why not run 8 SS processes per disk?
Redwood storage - It looks like upcoming changes will give clusters with RocksDB more capabilities, and Redwood isn’t seeing active development. Why isn’t RocksDB the recommended choice?