Proxies and packet loss/retransmits

I have three machines running on a shared datacenter network that sees some packet loss:

10%20pm

When I change from running 1 proxy on each of the machines, to a single proxy on one machine, the worst case latency improves considerably:

57%20pm

The current single proxy server is running at about 20% CPU.

The documentation advises against reducing the default number of proxy servers. Is this because a single proxy server may be a bottleneck at a higher throughput, or are there other reasons why there should be more than one even at smaller workloads? If a single proxy is a bad idea, I saw one forum post where a user had multiple proxy processes running on a single server - would that be a sensible alternative?

Getting a read version requires all the proxies to talk to each other. If there’s one proxy, there is no communication. If you run multiple proxies on one host, you’re only doing local TCP delivery, which is also faster. Given the network flakiness that you have, the same packet/connection drop and retransmit tail latency will happen between proxies, leading to doubly high GRV latency. At commit time, proxies talk to all the resolvers, and all the TLogs, leading to potentially triply high latency.

Running proxies on separate machines is recommended because commits are first sent to the proxy. Running them all on one machine means that all commits to your database go through one machine, and thus the write bandwidth for your database is limited by what one NIC can achieve. If write bandwidth isn’t a concern, and latency is, then running multiple proxies on one machine is fine. Running one proxy or multiple in that situation would purely depend on CPU usage. If your benchmarks of how you expect your application to use FDB show that with one proxy, your latency is fine and CPU/memory of the one proxy is fine, then it sounds like proxies=1 is a just fine configuration for you. I’d recommend placing the resolver on the same machine in that case as well.

Appreciate you taking the time to reply Alex. Moving the resolver seems to have helped a little as well.

The datacenter provides the option of an additional private network, but it requires all machines on the network to be in the same rack. I thought this would mean I’d have to keep all the log and storage servers in one spot if I wanted to improve the latency, so I’m happy to see that simply reducing the proxies is an option for now. And if the workload outgrows what a single machine can handle, I’m guessing one option would be to run a small set of machines on a private network to handle client+proxy+resolver traffic, while keeping the log and storage servers in separate locations. I’m impressed at the flexibility that fdb’s separate server roles afford.