Production optimizations

Probably 80-90 of the 100 severs are just going to be running storage servers. It’s probably slightly better to not assign them any process class, rather than assign them storage process class, so that in the case of failures, you have processes to fall back on that you can recruit other roles from.

It doesn’t really matter, depending on your workload. Data distribution aims to keep the used space percent roughly equal among storage servers. So nothing bad will happen if your run a cluster, half with 1TB drives and half with 10TB drives.

You will, however, have created a cluster where you have 10x more disk time for some data than other data. If you’re randomly reading keys from your database, 1TB storage servers are 10x more likely to have that data in memory, and if they don’t, have 10x less requests going to their drives. If you have largely cold data, then this won’t matter. If you’re latency sensitive, then you want to size your drives equally, and preferably to match your IO/s,

Think of coordinators as a zookeeper cluster. You’ll want to run 2n+1 of then, where n is the number of failures you can take and keep running. If you’re running in triple replication, then n=2. If you’re running double replication, then n=1.

If you’re running across multiple DC’s, it’s best to adjust this math a little bit, so that you have 2n+1 coordinators per DC, as always having a local coordinator will save you WAN traffic.

The exact best number of logs and proxies is going to vary strongly based on your workload. You essentially want the minimum number of them that can still support your load with some overhead left for safety.

My very rough guess would be to say when running with with the ssd storage engine, you’ll probably want about 1 log for every 8 storage servers, and probably around 1 proxy per log. But I would strongly suggest running some representative performance test, and varying the number of proxies and logs until you find an optimal spot.

Quite possibly, yes. The extra cores could be useful to run stateless roles, but they have their own demands. Proxies and logs are both network bandwidth hungry, as all commits in the system flow through them, so it’s good to plan out their network bandwidth allocations as well.

I’m used to thinking about this in a bare metal world, and not one in which your disk (and network?) scales with the number of cores you have. FDB, overall, should be more disk and network bottlenecked than CPU, as its primary job is to do IO.

If you’re writing any layers to run on top of FDB, then extra cores alongside your storage servers are great places to run you layers processes – both from a latency and binpacking perspective.

2 Likes