Transaction/operation throughput

I’m attempting to benchmark FoundationDB for a write heavy workload*, and I’m struggling to max out the storage engine(s). I’m wondering if there are known limitations to transaction / operation throughput per cluster and / or if there are places I could start looking into to potentially contribute changes to improve that throughput.

I’ve followed a number of the performance related threads here, so here is a lot of detail of what I’ve come up with thus far.

I’ve been comparing the memory storage engine on r5d.2xls to the redwood engine on i3.2xls. The performance gap between the two seems pretty small, maybe 10-20% more throughput for the in-memory engine. For both the in-memory and the ssd, the rate limiter doesn’t seem to report the storage engines getting backed up at all. I do see issues with initiating batch priority transactions listed. With the i3s, I’m able to see a range of about 20-30k iops. The AWS documentation indicates those instance types can get up to 180k iops, so there should be a good deal of headroom. The write queue shows a lot more variance going from ~5 to ~2k.

Drawing inspiration from this topic, I’ve explicitly been laying out the processes. Each box has four FDB processes. Storage boxes have four storage processes, and the logs boxes have one log process, one proxy process and two resolver processes. The cluster_controller and master are added as a fifth process to two random logs boxes. After a bunch of different configurations, I ended up with 128 storage boxes (512 processes), 32 logs, 32 proxies and 64 resolvers. I’m writing to the cluster from 32 different processes.

Data seems reasonably well distributed. I don’t see any outliers on any system metric. Network traffic stays under 80 MiB/s, so I don’t think we’re getting limited there. CPU/load is pretty low for the log servers. For the storage servers, load ranges from 0.3 to 0.6 for the storage servers, and CPU usage is between 20% and 40%.

The keys are on average ~100 bytes, with a range between about 50 and 230 bytes. The keys are made up of one of two unique 5 byte prefixes, followed by a few thousand 8 byte integers (sequential from an RDB), a few (usually) English words and a random 16 byte suffix. I’ve varied the number of keys written per batch and gotten the best performance around 200.

* I’m aware that FoundationDB isn’t the best fit for some write heavy workloads, but I have yet to see another open source ordered key-value store work more efficiently. I am also optimistic by the idea of adding a RocksDB backed storage backend assuming we can get the transaction throughput we need.

I would start with turning down the number of logs, proxies, and resolvers by a lot.

Start with one resolver and try something like 3 logs and 3 proxies. That probably won’t be your final number, but that’s a good place to start. The resolver probably won’t need more than one, but just look at the CPU usage and if it is really high that will let you know.

The tlogs, proxies, and resolvers should only be scaled up when you observe a specific bottleneck in one of those components.

I would suggest measuring max iops and network throughput those machines can do, just in case

Thanks for your help! I’m not done with the various testing I have planned in response to this, but I wanted to give some updates.

With i3.2xls and the Redwood driver, I was able to get to the point where storage CPU was clearly limiting the throughput. The rate limiter was not reporting the storage engines being slow, but a number of storage nodes had CPU usage > 90 on fdbtop and normalized load > 1. With i3xls and the older SSD engine, I ended up with similar throughput but no clear bottleneck. The storage instances all had CPU < 85 and load < 1 with only a handful of nodes having load above 0.9. Write iops were still only at about 30k, so I don’t think that was the limit.

Do you happen to have an estimate of what performance impact you would expect from using ~100 byte keys and no values vs the key/value sizes used in the benchmark? I’m seeing about 1.6 M ops/sec vs a projected > 10 M ops/sec from that page.

Thanks for this suggestion. Going through this in a more principled manner, I was able to get the same performance out of the i3.2xls with the Redwood engine with ~15 logs, 15 resolvers and 8 proxies and ended up with the same throughput as with the configuration above. At that point, the performance was limited by storage engine CPU.

Unfortunately, looking for specific bottlenecks in the logs/proxies/resolvers fdbtop output / system metrics was not a very valuable guide. For example, with something like 6 logs, 3 proxies and 1 resolver, everything looked fine (almost all CPU below 85% on fdbtop, network/memory/iops were never a problem for log/stateless servers) except for the fact that the cluster was doing an order of magnitude less throughput than I had previously observed. Some of the time, I would see CPU becoming a bottleneck, but for the most part, I kept adding instances until that stopped improving performance, using CPU as a guide of which instance type to add.

I haven’t managed to measure iops just yet. We have tested the network on these boxes before and seen close to 5 Gb/s from 1 thread and 10 Gb/s with 5 thread, which is about what is advertised. I haven’t managed to get above 1.5 Gb/s on any host, so I can’t imagine network is the limiting factor at any point in time.

You should probably repeat your experiment with ssd as well. Though redwood is the future, and I greatly appreciate you testing it, it currently has no promise of stability and is very much still in-development. Its CPU usage is, in particular, a thing that will continue to be worked on, so at minimum be aware that you’re benchmarking a moving target.

That page is exceedingly old, and I think the knowledge of what was even run and how has been lost. There’s been discussion of just removing the page. I’d honestly put more trust in the numbers that you’ve seen in other threads, or if Snowflake or Wavefront folk could share their numbers as to how a well-tuned FDB-on-cloud should perform.

Use one resolver. Always use one resolver. If you are saturated on resolver CPU, then use two resolvers. I’ve honestly never seen a cluster/workload yet that requires more than two resolvers. Enough people have hit this that I’m honestly starting to be tempted to make configure resolvers=3 or higher return an error (or at least a stern warning).

Yep, I’ve been running with both. On the r5ds, the lower iops for Redwood made a difference. On the i3s, I seem to be hitting a bottleneck somewhere upstream of the storage engine.

This was very helpful. I think the issue I’m running into is that the load balancing across resolvers isn’t working properly for my use case. I ran the test with the in-memory storage engine* and only scaled up the number of resolvers when there was one or more resolvers with CPU > 95% for more than a couple minutes and ended up with 8 resolvers. I believe I saw a throughput increase for most (if not all) or those increases. Many of those involved one resolver being completely pinned (CPU > 99%) while the average resolver CPU was between 70 and 80%. The first ~9 bytes of my keys are made up of one of two prefixes and one of the prefixes receives about 25x more writes than the other. While there is substantially more entropy in the keys beyond those prefixes, could it be that all those writes are getting sent to one resolver and overwhelming it? It wouldn’t be hard to put a hash in the first byte to distribute data more evenly. If that’s likely not the issue, do you have any pointers of where I should be looking to figure out where this imbalance is happening?

* I saw ~the same throughput with the in-memory engine as the ssd engine, which seems to confirm that the bottleneck is upstream of there.

The assignments of key ranges to resolvers can be shuffled and re-sharded to try to adapt to incoming workload characteristics. Given that I’m not aware aware of anyone running with >2 resolvers in production, it would be totally plausible to me that distributing ranges across resolvers isn’t a well tuned thing.

Checking a large cluster that we have for saturation tests that’s also ~500 processes, it is configured to proxies=30 logs=30 resolvers=2. So I’m kind of surprised that your workload on a similar sized cluster also at saturation requires 4x the amount of resolver CPU power. It’d be good to double check your recruitment, and that you’re not accidentally sharing processes between resolvers and some other role?