Transaction/operation throughput

I’m attempting to benchmark FoundationDB for a write heavy workload*, and I’m struggling to max out the storage engine(s). I’m wondering if there are known limitations to transaction / operation throughput per cluster and / or if there are places I could start looking into to potentially contribute changes to improve that throughput.

I’ve followed a number of the performance related threads here, so here is a lot of detail of what I’ve come up with thus far.

I’ve been comparing the memory storage engine on r5d.2xls to the redwood engine on i3.2xls. The performance gap between the two seems pretty small, maybe 10-20% more throughput for the in-memory engine. For both the in-memory and the ssd, the rate limiter doesn’t seem to report the storage engines getting backed up at all. I do see issues with initiating batch priority transactions listed. With the i3s, I’m able to see a range of about 20-30k iops. The AWS documentation indicates those instance types can get up to 180k iops, so there should be a good deal of headroom. The write queue shows a lot more variance going from ~5 to ~2k.

Drawing inspiration from this topic, I’ve explicitly been laying out the processes. Each box has four FDB processes. Storage boxes have four storage processes, and the logs boxes have one log process, one proxy process and two resolver processes. The cluster_controller and master are added as a fifth process to two random logs boxes. After a bunch of different configurations, I ended up with 128 storage boxes (512 processes), 32 logs, 32 proxies and 64 resolvers. I’m writing to the cluster from 32 different processes.

Data seems reasonably well distributed. I don’t see any outliers on any system metric. Network traffic stays under 80 MiB/s, so I don’t think we’re getting limited there. CPU/load is pretty low for the log servers. For the storage servers, load ranges from 0.3 to 0.6 for the storage servers, and CPU usage is between 20% and 40%.

The keys are on average ~100 bytes, with a range between about 50 and 230 bytes. The keys are made up of one of two unique 5 byte prefixes, followed by a few thousand 8 byte integers (sequential from an RDB), a few (usually) English words and a random 16 byte suffix. I’ve varied the number of keys written per batch and gotten the best performance around 200.

* I’m aware that FoundationDB isn’t the best fit for some write heavy workloads, but I have yet to see another open source ordered key-value store work more efficiently. I am also optimistic by the idea of adding a RocksDB backed storage backend assuming we can get the transaction throughput we need.

I would start with turning down the number of logs, proxies, and resolvers by a lot.

Start with one resolver and try something like 3 logs and 3 proxies. That probably won’t be your final number, but that’s a good place to start. The resolver probably won’t need more than one, but just look at the CPU usage and if it is really high that will let you know.

The tlogs, proxies, and resolvers should only be scaled up when you observe a specific bottleneck in one of those components.

I would suggest measuring max iops and network throughput those machines can do, just in case

Thanks for your help! I’m not done with the various testing I have planned in response to this, but I wanted to give some updates.

With i3.2xls and the Redwood driver, I was able to get to the point where storage CPU was clearly limiting the throughput. The rate limiter was not reporting the storage engines being slow, but a number of storage nodes had CPU usage > 90 on fdbtop and normalized load > 1. With i3xls and the older SSD engine, I ended up with similar throughput but no clear bottleneck. The storage instances all had CPU < 85 and load < 1 with only a handful of nodes having load above 0.9. Write iops were still only at about 30k, so I don’t think that was the limit.

Do you happen to have an estimate of what performance impact you would expect from using ~100 byte keys and no values vs the key/value sizes used in the benchmark? I’m seeing about 1.6 M ops/sec vs a projected > 10 M ops/sec from that page.

Thanks for this suggestion. Going through this in a more principled manner, I was able to get the same performance out of the i3.2xls with the Redwood engine with ~15 logs, 15 resolvers and 8 proxies and ended up with the same throughput as with the configuration above. At that point, the performance was limited by storage engine CPU.

Unfortunately, looking for specific bottlenecks in the logs/proxies/resolvers fdbtop output / system metrics was not a very valuable guide. For example, with something like 6 logs, 3 proxies and 1 resolver, everything looked fine (almost all CPU below 85% on fdbtop, network/memory/iops were never a problem for log/stateless servers) except for the fact that the cluster was doing an order of magnitude less throughput than I had previously observed. Some of the time, I would see CPU becoming a bottleneck, but for the most part, I kept adding instances until that stopped improving performance, using CPU as a guide of which instance type to add.

I haven’t managed to measure iops just yet. We have tested the network on these boxes before and seen close to 5 Gb/s from 1 thread and 10 Gb/s with 5 thread, which is about what is advertised. I haven’t managed to get above 1.5 Gb/s on any host, so I can’t imagine network is the limiting factor at any point in time.

You should probably repeat your experiment with ssd as well. Though redwood is the future, and I greatly appreciate you testing it, it currently has no promise of stability and is very much still in-development. Its CPU usage is, in particular, a thing that will continue to be worked on, so at minimum be aware that you’re benchmarking a moving target.

That page is exceedingly old, and I think the knowledge of what was even run and how has been lost. There’s been discussion of just removing the page. I’d honestly put more trust in the numbers that you’ve seen in other threads, or if Snowflake or Wavefront folk could share their numbers as to how a well-tuned FDB-on-cloud should perform.

Use one resolver. Always use one resolver. If you are saturated on resolver CPU, then use two resolvers. I’ve honestly never seen a cluster/workload yet that requires more than two resolvers. Enough people have hit this that I’m honestly starting to be tempted to make configure resolvers=3 or higher return an error (or at least a stern warning).

Yep, I’ve been running with both. On the r5ds, the lower iops for Redwood made a difference. On the i3s, I seem to be hitting a bottleneck somewhere upstream of the storage engine.

This was very helpful. I think the issue I’m running into is that the load balancing across resolvers isn’t working properly for my use case. I ran the test with the in-memory storage engine* and only scaled up the number of resolvers when there was one or more resolvers with CPU > 95% for more than a couple minutes and ended up with 8 resolvers. I believe I saw a throughput increase for most (if not all) or those increases. Many of those involved one resolver being completely pinned (CPU > 99%) while the average resolver CPU was between 70 and 80%. The first ~9 bytes of my keys are made up of one of two prefixes and one of the prefixes receives about 25x more writes than the other. While there is substantially more entropy in the keys beyond those prefixes, could it be that all those writes are getting sent to one resolver and overwhelming it? It wouldn’t be hard to put a hash in the first byte to distribute data more evenly. If that’s likely not the issue, do you have any pointers of where I should be looking to figure out where this imbalance is happening?

* I saw ~the same throughput with the in-memory engine as the ssd engine, which seems to confirm that the bottleneck is upstream of there.

The assignments of key ranges to resolvers can be shuffled and re-sharded to try to adapt to incoming workload characteristics. Given that I’m not aware aware of anyone running with >2 resolvers in production, it would be totally plausible to me that distributing ranges across resolvers isn’t a well tuned thing.

Checking a large cluster that we have for saturation tests that’s also ~500 processes, it is configured to proxies=30 logs=30 resolvers=2. So I’m kind of surprised that your workload on a similar sized cluster also at saturation requires 4x the amount of resolver CPU power. It’d be good to double check your recruitment, and that you’re not accidentally sharing processes between resolvers and some other role?

I don’t recall ever seeing a second role present along with the resolver role. I looked out for that in my tests this morning, and I did not see the resolver process having a second role.

I tried the 30 logs, 30 proxies and 2 resolvers setup as a starting point this morning. The initial test had a ton of conflicts and was just falling apart. I reduced the number of clients (each with 4 threads creating transactions) from 64 to 16, and the FDB cluster was able to properly handle the load.

The initial workload was substantially below what I had seen in my previous tests:

Workload:
  Read rate              - 662 Hz
  Write rate             - 477558 Hz
  Transactions started   - 2477 Hz
  Transactions committed - 2393 Hz
  Conflict rate          - 2 Hz

Adding a third resolver showed an almost immediate impact:

Workload:
  Read rate              - 1345 Hz
  Write rate             - 568973 Hz
  Transactions started   - 2956 Hz
  Transactions committed - 2854 Hz
  Conflict rate          - 2 Hz

I kept adding resolvers until I no longer saw a jump in transaction throughput:

and ended up with 7 resolvers. My final workload was 2x the throughput with two resolvers but still much lower than what I had seen with fewer logs/proxies:

Workload:
  Read rate              - 732 Hz
  Write rate             - 855012 Hz
  Transactions started   - 4556 Hz
  Transactions committed - 4276 Hz
  Conflict rate          - 0 Hz

I’m wondering if I’ve hit some scaling limitation on the write side. Since I don’t know what else to try, I’m going to try randomizing the prefix to see if that helps at all. Any advice on where I could start investigating on the FDB side would be greatly appreciated.

The random keys did not help at all. My next thought is to profile the resolver for my use case, but I’m more than open to other suggestions / ideas you might have.

Well, congrats on running a more resolver-intensive workload than I’ve seen before. I think basically what you’re doing of hill climbing/newton’s method’ing your way to your optimal number of logs/proxies/resolvers is better than whatever advice I’ll give. As you’re approaching 1mil writes per second, it is totally possible that you’re hitting a current throughput limitation. You’re doing ~4k transactions per second, each of doing 200 writes of 100 byte keys and 1000 byte values? That adds up to something a bit under 1GB/s of writes, which I think is about the upper bound of anything I’ve heard someone report of doing with FDB so far.

If there’s not an obvious case of one class of role being CPU/network/disk saturated, then the debugging here rapidly escalates in technical complexity. If I were to start to debug this, I’d go look at the CommitDebug trace events, to piece together what part of the system is being the slowest for you, over time, and then go dig into why. Ideally, if you look at all the messages with the same debugID, there will be some obvious piece that is consistently taking longer. It could either turn out that it’s a role (all resolvers are taking ~3s) to respond, or a single process (transaction log 0 is responding 3s after all other logs, consistently). And then we branch out based on what role seems suspicious.

1 Like

One thing to look out for is that the IDs for a single request do change at various points in the commit process due to things like batching. You can piece together the sequence of ID relations using CommitAttachID events, but if you aren’t explicitly enabling transaction tracing on any of your transactions, then the internally traced transactions are 10 seconds apart and you can distinguish different sequences by when they ran.

In other words, assuming commits are taking somewhat less than 10 seconds, you can look for all of the CommitDebug messages that are close together in time with these two starting and ending Location fields:

Location="NativeAPI.commit.Before"
Location="NativeAPI.commit.After"