Production optimizations

The 1 log : 8 ssd storage ratio is only roughly what I’ve seen. I’m being very specific about ssd, because the memory storage engine can apply mutations to durable storage faster, so the ratio changes. I think it’s closer to 1:2, but it’s not a thing I’ve benchmarked as often.

Ideally, you’d run a single ssd proxy=5 log=5 1 storage server cluster, run a write heavy benchmark, and look at the trace files from the storage server to determine what rate of mutations your storage server can apply on your hardware. Then do the same with a single ssd proxy=5 log=1 10 storage server cluster, and see what rate of mutations one tlog can support.

These two figures then give you your ratios. Each additional tlog and proxy you add isn’t going to give you quite as much benefit as the last, but your goal is to make sure that you have enough tlogs to be able to feed your storage servers at their full rate, and then give yourself an extra little bit of headroom so that you aren’t running your cluster at 100% and thus getting poor latency.

This is, of course, assuming you’re targeting being able to do full write throughput. If you have a lot of cold data, and care more about commit latency, then running with less proxies or logs than your maximum would be a better option.

I filed Storage Server recruitment should consider existing recruited roles #552 a bit ago, as explicitly assigning process classes is mostly about keeping storage servers away from other latency-critical parts of the system. I don’t really have any ideas on how we would do better auto-configuration of proxies and logs, as changing them invokes a recovery, and a recovery means O(hundreds of milliseconds) write downtime / latency spike.

2 Likes