Benchmarking FoundationDB on AWS

FoundationDB scaling benchmark shows impressive performance on an AWS deployment using EC2 c3.8xlarge cluster. I’m trying to replicate this setup, but fail to get similar levels of performance out of AWS so far.

Would it be possible to share more details about the setup that would allow achieving similar levels of performance on AWS?

  • choice of storage engine
  • replication factor
  • distribution of nodes between the availability zones
  • changes in FDB configs
  • node storage (local instance, EBS (gp/io), EnhanceIO etc)
  • workload setup - number of machines, level of concurrency and the driver language
  • network performance tuning for FDB

Any hints would be appreciated!

2 Likes

Yes, I’ve benchmarked foundationdb with go-ycsb on our private cloud too, with both “single” and “triple” replication factor and RAM engine. But only get ~38K qps while get ~69K qps on my local laptop. And the following is my config:

1. This is my go-ycsb workload (100% read):

recordcount=10000000
operationcount=10000000
workload=core
readallfields=false
writeallfields=true
readproportion=1
updateproportion=0
scanproportion=0
insertproportion=0
requestdistribution=zipfian
fieldcount=1
fieldlength=20

2. This is my result. I used 2 machine to run go-ycsb independently, and get almost the same result on each.

./bin/go-ycsb run foundationdb -P workloads/workloadc -p fdb.cluster=./fdb.cluster -p threadcount=1000
READ - Takes(s): 10.0, Count: 391855, QPS: 39154.3, Avg(us): 25486, Min(us): 6377, Max(us): 88940, 95th(us): 58000, 99th(us): 69000
READ - Takes(s): 20.0, Count: 787808, QPS: 39374.7, Avg(us): 25376, Min(us): 6377, Max(us): 88940, 95th(us): 60000, 99th(us): 68000
READ - Takes(s): 30.0, Count: 1153112, QPS: 38426.9, Avg(us): 25980, Min(us): 6377, Max(us): 94534, 95th(us): 61000, 99th(us): 72000
READ - Takes(s): 40.0, Count: 1528051, QPS: 38193.7, Avg(us): 26169, Min(us): 6377, Max(us): 94534, 95th(us): 61000, 99th(us): 73000
READ - Takes(s): 50.0, Count: 1891968, QPS: 37833.4, Avg(us): 26422, Min(us): 3928, Max(us): 94534, 95th(us): 62000, 99th(us): 74000
READ - Takes(s): 60.0, Count: 2262222, QPS: 37698.7, Avg(us): 26508, Min(us): 3928, Max(us): 94534, 95th(us): 62000, 99th(us): 73000
READ - Takes(s): 70.0, Count: 2638403, QPS: 37687.1, Avg(us): 26520, Min(us): 3928, Max(us): 94534, 95th(us): 62000, 99th(us): 74000
READ - Takes(s): 80.0, Count: 2985904, QPS: 37320.1, Avg(us): 26785, Min(us): 3928, Max(us): 97336, 95th(us): 62000, 99th(us): 75000
READ - Takes(s): 90.0, Count: 3337194, QPS: 37076.7, Avg(us): 26956, Min(us): 3928, Max(us): 97336, 95th(us): 63000, 99th(us): 75000
READ - Takes(s): 100.0, Count: 3680602, QPS: 36803.1, Avg(us): 27162, Min(us): 3928, Max(us): 97336, 95th(us): 63000, 99th(us): 76000
READ - Takes(s): 110.0, Count: 4023143, QPS: 36571.4, Avg(us): 27327, Min(us): 3928, Max(us): 262296, 95th(us): 63000, 99th(us): 76000
READ - Takes(s): 120.0, Count: 4387477, QPS: 36559.9, Avg(us): 27344, Min(us): 3928, Max(us): 262296, 95th(us): 63000, 99th(us): 76000
READ - Takes(s): 130.0, Count: 4751617, QPS: 36548.7, Avg(us): 27353, Min(us): 3928, Max(us): 262296, 95th(us): 63000, 99th(us): 76000
READ - Takes(s): 140.0, Count: 5110191, QPS: 36499.3, Avg(us): 27385, Min(us): 3928, Max(us): 262296, 95th(us): 64000, 99th(us): 76000
READ - Takes(s): 150.0, Count: 5457514, QPS: 36381.5, Avg(us): 27479, Min(us): 3928, Max(us): 262296, 95th(us): 64000, 99th(us): 76000
READ - Takes(s): 160.0, Count: 5801829, QPS: 36259.6, Avg(us): 27570, Min(us): 3928, Max(us): 262296, 95th(us): 64000, 99th(us): 76000
READ - Takes(s): 170.0, Count: 6151706, QPS: 36184.8, Avg(us): 27627, Min(us): 3928, Max(us): 262296, 95th(us): 64000, 99th(us): 76000
READ - Takes(s): 180.0, Count: 6507109, QPS: 36149.0, Avg(us): 27651, Min(us): 3928, Max(us): 262296, 95th(us): 64000, 99th(us): 76000
READ - Takes(s): 190.0, Count: 6832262, QPS: 35957.8, Avg(us): 27803, Min(us): 3928, Max(us): 262296, 95th(us): 65000, 99th(us): 77000
READ - Takes(s): 200.0, Count: 7173362, QPS: 35865.4, Avg(us): 27868, Min(us): 3928, Max(us): 262296, 95th(us): 65000, 99th(us): 77000
READ - Takes(s): 210.0, Count: 7531922, QPS: 35864.9, Avg(us): 27875, Min(us): 3928, Max(us): 262296, 95th(us): 65000, 99th(us): 77000
READ - Takes(s): 220.0, Count: 7882459, QPS: 35828.1, Avg(us): 27904, Min(us): 3928, Max(us): 262296, 95th(us): 65000, 99th(us): 77000
READ - Takes(s): 230.0, Count: 8226143, QPS: 35764.6, Avg(us): 27952, Min(us): 3928, Max(us): 262296, 95th(us): 66000, 99th(us): 77000
READ - Takes(s): 240.0, Count: 8575813, QPS: 35731.4, Avg(us): 27980, Min(us): 3928, Max(us): 262296, 95th(us): 66000, 99th(us): 77000
READ - Takes(s): 250.0, Count: 8960553, QPS: 35841.1, Avg(us): 27894, Min(us): 3928, Max(us): 262296, 95th(us): 66000, 99th(us): 77000
READ - Takes(s): 260.0, Count: 9311212, QPS: 35811.3, Avg(us): 27918, Min(us): 3928, Max(us): 262296, 95th(us): 66000, 99th(us): 77000
READ - Takes(s): 270.0, Count: 9662951, QPS: 35787.7, Avg(us): 27935, Min(us): 3928, Max(us): 262296, 95th(us): 66000, 99th(us): 78000
Run finished, takes 4m39.745710336s
READ - Takes(s): 279.7, Count: 10000000, QPS: 35749.8, Avg(us): 27952, Min(us): 1045, Max(us): 262296, 95th(us): 66000, 99th(us): 78000

3. This is detailed status on each fdb instance:

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster’.

Configuration:
Redundancy mode - triple
Storage engine - memory
Coordinators - 3

Cluster:
FoundationDB processes - 3
Machines - 3
Memory availability - 7.6 GB per process on machine with least available
Retransmissions rate - 1 Hz
Fault Tolerance - 0 machines (1 without data loss)
Server time - 05/14/18 18:22:56

Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 226 MB
Disk space used - 2.727 GB

Operating space:
Storage server - 493.6 GB free on most full server
Log server - 493.6 GB free on most full server

Workload:
Read rate - 81502 Hz
Write rate - 1 Hz
Transactions started - 80006 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz

Backup and DR:
Running backups - 0
Running DRs - 0

Process performance details:
10.5.0.23:4500 ( 40% cpu; 5% machine; 0.027 Gbps; 1% disk IO; 0.8 GB / 7.6 GB RAM )
10.5.0.52:4500 ( 41% cpu; 5% machine; 0.028 Gbps; 1% disk IO; 0.8 GB / 7.6 GB RAM )
10.5.0.55:4500 ( 42% cpu; 5% machine; 0.026 Gbps; 1% disk IO; 0.9 GB / 7.6 GB RAM )

Coordination servers:
10.5.0.23:4500 (reachable)
10.5.0.52:4500 (reachable)
10.5.0.55:4500 (reachable)

Client time: 05/14/18 18:22:56

4. And I used 3 fdb instances as coordinators

I haven’t figured out where is the bottleneck (fdb go client or go-ycsb)

@cih.y2k thank you for the reference to go-ycsb. I didn’t know, there was a load generator supporting FoundationDB.

For the reference, that is the kind of performance I’m getting out of FDB (a transaction is 1 key lookup and 2 writes). Probably bottlenecked at the load tester at this point.

It doesn’t look like the cluster is saturated in this test, so as you indicate your limit is likely somewhere on the client side. To benchmark the throughput of the cluster, you would need to drive sufficient load, probably with several clients (see Developer Guide — FoundationDB 7.1).

In a FoundationDB client, there is currently a single thread (called the network thread) which serializes the client operations, does the FDB related client-side processing, and handles the network communication. It’s possible for your client to saturate this thread, at which point you’ve reached the maximum throughput of an individual client. It’s also possible that if you don’t have enough concurrency on the client, you could instead be latency bound. In that case, you could try turning up the concurrency within your clients as well.

I don’t know the exact details of the test that was run for this, though someone else here may. I can say, though, that the test used our built-in performance testing capabilities. Evan posted a summary for how to use them in this forum post. The test that was run in the scaling benchmark was probably the RandomReadWrite test linked to by Evan, but with some different parameters. If you go this route, you’ll want to make sure to have sufficient clients (i.e. the test class processes) to saturate the cluster if you’re trying to get a sense for cluster throughput.

To emphasize this point, do note that go-ycsb currently only uses a single client, so you’re limited to how much work one thread can generate. This unfortunately means that in its current state, benchmarking real clusters with go-ycsb will likely be more limited by the client than the database, as AJ points out.

I had a large offline reply to siddon about the go-ycsb benchmarking, which I should probably go clean up and post publicly. And hopefully he’ll have some more time available sometime to continue to improve go-ycsb and its ability to benchmark FoundationDB. :slight_smile:

By the way, does go-ycsb deal with the coordinated omission?

I’ve discovered today a pretty nasty side effect of having conflicts at high throughput after being reminded about CO and switching load tester to generate transactions at a fixed frequency. The bottleneck (that looked like a tiny latency spike before) blew up like this in the benchmark at 20kHz:

If you look at the run loop for the go-ycsb worker, it appears to run an operation every $n$ ms, or immediately if one took longer than $n$ ms. That would imply that it is afflicted by coordinated omission. It’d hopefully be a relatively easy change to instead just fork off another goroutine for a request every $n$ ms, if one isn’t already available, which would solve the issue.

I’m sure siddon would appreciate an issue filed against go-ycsb, if you don’t mind filing one. :slight_smile:

@alexmiller, @ajbeamon I didn’t remember that fdb go client using only one network thread. So, maybe it makes sense to use multiple clients in one application.

oh, maybe I will update my YCSB to support multi clients for better benchmark now.

@alexmiller, it appears that go-ycsb is a port from the original JAVA project which has the discussion about the coordinated omission in progress since 2016: https://github.com/brianfrankcooper/YCSB/issues/731

They are discussing correcting the CO, not avoiding it, though (which is probably a more tricky approach than simply launching goroutines on a schedule).

As a side note, this is a property of our native client (libfdb_c), so it actually applies to all of our bindings.

I was able to push a cluster of 15x i3.large machines to handle 35k event store append transactions per second.

Lessons learned:

  1. Model the layer to inject keys sequentially. This significantly reduces the load on the FoundationDB cluster that is bottlenecked on IO (e.g. that uses EBS + io disks)
  2. Use storage-optimized instances with a fast NVMe SSD, let FoundationDB handle fault tolerance.
  3. Launching a second FDB process on a VM with 2 cores might increase throughput a bit more (especially, if the cluster is bottleneck is on the CPU)
  4. Running load tester without the coordinated omission is more brutal and honest to the database. If it goes down under the load, then it would probably not come back. This speeds up benchmarking iterations

Here is the performance chart from one of the load agents (50% of load):

This is similar to what we’re using in production (though more often with i3.16xlarge) with some form of caching to use the NVMe instance stores to back EBS block devices.

It’s about 50k read/writes a second.

What is your workload like in terms of keys read/written per transaction and bytes read/written? Also are you R/W a lot of keys in small ranges in each transaction or all over the keyspace?

Thanks, @alexmiller, would you mind to share the progress on go-yscb fixing? I use go-yscb for benchmark and also get a similar performance level which is about 40k QOS per sescond.

@TommyLike, Looking through the commit history, I don’t see any changes to go-ycsb that would add multiple process support. However, you’d be best off getting in touch with the go-ycsb developers for questions like this.

@alexmiller. Great thanks for your guidance, I personally checked the commit history and got no lucky too, but, I found another tool fdb-test-cluster which can easily benchmark the cluster within native tests via terraform and packer on AWS, thanks!

For what it’s worth, we recently moved off of using EBS volumes due to its volatile nature. We were having serious problems with flushing to disk which caused FDB to eventually fall over once memory was exhausted.

We recently moved over to using c5d.2xlarge which provides a dedicated NVME drive and performance has been a whole lot better / consistent.

Our write load is 4-6k TPS and our disks are consistently at 0-5% utilization.