Scripts to deploy, benchmark, and tinker with 1M operations/sec FoundationDB cluster on AWS

There’s a lack of documentation on how to reproduce the results of official benchmark from the docs and I see that many developers are getting only a fraction of that performance. I think it is a good idea to make a reproducible reference cluster deployment that produces good benchmark results, it can be a good starting point for your custom cluster and is a great tool to understand how to tune fdb performance.

https://github.com/poma/fdb-test-cluster

These terraform and packer scripts deploy a high performance fdb cluster with benchmarking and monitoring tools. By default the cluster consists of 8x m5d.2xlarge (8 cores, NVMe SSD) fdb instances with double ssd engine, plus 4x tester instances of the same size. Scripts work by creating a base AMI and then deploying a cluster of them with custom config files. It allows to quickly deploy, banchmark, tune, and tear down a cluster so you only pay for a short uptime, the default cluster should cost around $6/hr. For detailed deployment info see readme in the repo. Scripts are based on this thread.

Performance is tested using built-in fdb testing tools, to reduce possibility of inaccurate results due to inefficiencies in the test software. To get the best performance out of your cluster it is better to treat fdb as a set of microservices that each have their own resource requirements and need a separate monitoring and scaling. Optimal role ratios and placement depend on your workload, machines, and topology so there is no golden rule, you need to actually test it and figure out bottlenecks. You can use included fdbtop tool monitor performance of each process.

Here are some specific features of the cluster:

  • all process roles are specified explicitly
  • log and storage roles are placed on different machines so that they don’t compete for the same disk so that logs have low latency
  • the write subsystem processess (master, proxies, resolvers, logs) are placed close together for faster network communication. If your cluster is limited by network bandwidth, it might be better to do the opposite and spread out proxies and logs so that you saturate more network links.
  • AWS NVMe disks are fast enough (~57k write iops for m5d.2xlarge) that they require multiple storage processes to saturate them even with SSD storage engine, especially on a read heavy workload. In addition, on AWS disk performance scales with core count. If you have slower or network disks you might want to reduce the amount of storage processes per disk down to 1-2.
  • master role uses a separate process to make sure it has enough resources to generate commit versions with low latency
  • although overall throughput can be increased by adding storage processes to log instances (their disk is not saturated), it may degrade latency in some workloads

I’m still in the process of figuring out how to better configure the cluster so any feedback or improvements to my configs are greatly appreciated.

7 Likes

Nice work and thanks for contributing that. I’ll definitely check it out this evening and looking forward to toying with it.

Hi Roman,

I wanted to play with it since coming across it a couple days. I got a brand new AWS account to run this against (you’re never too careful ;-)) and got everything up and working as expected.
I’m happy to contribute with a tiny PR to make the AWS Region configurable (you have it hard-coded to us-west-1).
It’s very cool how you used packer and terraform to assist with all the tedious setup. I’ve used it in the past but stopped using it in favour of AWS Launch Template / CF and GCP Template Manager.
I was a bit surprised to see how you organised all your stateless and transaction processes onto 2 instances, and then had 6 instances with 8 cores running storage. You say it’s pretty rare to max-out the NVME disks - even with 8 storage processes!
Is that configuration something you would run in PROD or more a starting point?
I’m waiting for some EC2 limits to be raised in my account to run some real tests… but i can see how useful that project can be.
Thank you Roman for your contribution! I’ll be in touch I’m sure.
Cheers

8 storages are used because the default test is read heavy (9r+1w). With more writes it quickly becomes enough to use only 2 storage processes.

Hey @poma,
Thanks for your repo for rapid cluster deployment, I built the same foundationDB cluster with the default config, and tried several RandomReadWrite tests, here are several test config and results.

100 % Read

Config

# FDB test definition
testTitle=RandomReadWriteTest
testName=ReadWrite
# you generally want to run the test at least 2 minutes
# to avoid edge effects when log servers are not yet
# continuously transferring data to storage servers
# ("burst" performance on short tests will be higher
# than "sustained" performance)
testDuration=300
# set max transactions lower if you want to measure
# latencies on a non-saturated cluster
transactionsPerSecond=500000
writesPerTransactionA=0
readsPerTransactionA=10
writesPerTransactionB=0
readsPerTransactionB=10
# transactions A/B
alpha=0.1
# key count and value min/max sizes
nodeCount=10000000
valueBytes=1000
minValueBytes=200
# misc
discardEdgeMeasurements=false
warmingDelay=5.0
timeout=300000.0
databasePingDelay=300000.0

Result (result has been sumed or averaged)

There are 32 tests in total
Measured Duration: 300.0
Operations/sec: 1724545.467 
Transactions/sec: 172454.547 
90% Latency (ms, averaged): 839.544
98% Latency (ms, averaged): 916.317
Max Latency (ms, averaged): 1070.031
Max Commit Latency (ms, averaged): 0.000
Bytes read/sec: 1062320007.467
Bytes written/sec: 0.000
Retries: 0.0
Mean Latency (ms): 722.394.

95%Read&5%Write

Config

# FDB test definition
testTitle=RandomReadWriteTest
testName=ReadWrite
# you generally want to run the test at least 2 minutes
# to avoid edge effects when log servers are not yet
# continuously transferring data to storage servers
# ("burst" performance on short tests will be higher
# than "sustained" performance)
testDuration=300
# set max transactions lower if you want to measure
# latencies on a non-saturated cluster
transactionsPerSecond=500000
writesPerTransactionA=0
readsPerTransactionA=10
writesPerTransactionB=10
readsPerTransactionB=0
# transactions A/B
alpha=0.05
# key count and value min/max sizes
nodeCount=10000000
valueBytes=1000
minValueBytes=200
# misc
discardEdgeMeasurements=false
warmingDelay=5.0
timeout=300000.0
databasePingDelay=300000.0

Result (result has been sumed or averaged)

There are 32 tests in total
Measured Duration: 300.0
Operations/sec: 869872.667 
Transactions/sec: 86987.267 
90% Latency (ms, averaged): 2015.794
98% Latency (ms, averaged): 2569.838
Max Latency (ms, averaged): 6028.235
Max Commit Latency (ms, averaged): 558.016
Bytes read/sec: 509086752.533
Bytes written/sec: 26754810.133
Retries: 0.0
Mean Latency (ms): 1428.943.

50%Read & 50%Write

Config

# FDB test definition
testTitle=RandomReadWriteTest
testName=ReadWrite
# you generally want to run the test at least 2 minutes
# to avoid edge effects when log servers are not yet
# continuously transferring data to storage servers
# ("burst" performance on short tests will be higher
# than "sustained" performance)
testDuration=300
# set max transactions lower if you want to measure
# latencies on a non-saturated cluster
transactionsPerSecond=500000
writesPerTransactionA=10
readsPerTransactionA=0
writesPerTransactionB=0
readsPerTransactionB=10
# transactions A/B
alpha=0.5
# key count and value min/max sizes
nodeCount=10000000
valueBytes=1000
minValueBytes=200
# misc
discardEdgeMeasurements=false
warmingDelay=5.0
timeout=300000.0
databasePingDelay=300000.0

Result (result has been sumed or averaged)

There are 32 tests in total
Measured Duration: 300.0
Operations/sec: 97697.533 
Transactions/sec: 9769.753 
90% Latency (ms, averaged): 18885.821
98% Latency (ms, averaged): 22428.299
Max Latency (ms, averaged): 30240.803
Max Commit Latency (ms, averaged): 1682.325
Bytes read/sec: 30104392.267
Bytes written/sec: 30077288.267
Retries: 0.0
Mean Latency (ms): 12450.510.

My question is why the latency in these results are quite different to the official [performance] (https://apple.github.io/foundationdb/performance.html), it’s about 1000
times slower, Is there something wrong in the tests?

The one thing that stands out to me is that your tests are specifying a transaction rate that they aren’t achieving, which suggests that they’re saturating the cluster. To measure meaningful latencies, you’ll want to run at a rate lower than the max throughout of the cluster. For example, you could try choosing a transaction rate that is half the throughput your saturating test achieved and measure latencies there. Or even better, you could measure latencies at different rates to get a sense for how latency responds to load up to the saturation point.

Thanks @ajbeamon, so if I want to get the correct latency, what’s the correct transaction rate I need? According to the Throughput(per core) part on performance blog, assume my cluster have 48 storage processes, I use SSD storage engine and when test there are 10 operations in one transaction, the transactionsPerSecond should be 50k*48/10=24000?

The output of the test includes a metric for the transaction throughput. For example, in 50/50 test, it achieved ~9770 transactions per second:

Transactions/sec: 9769.753

You could start by taking this number as the saturation point and then choose your target rate relative to that. There’s not a particular correct answer here, as you could choose various percentages of that saturation number. Or if you are targeting a particular throughput, you could adjust the size and configuration of your cluster until it’s capable of handling that load while producing the latencies you want.

Got it, thanks :smile:

Hey @poma, here again I met another issue when using the fdb-test-cluster, here below is my environment,
Cluster Instance: 11 * md5,xlarge (4vCPU, 16GiB and 500 SSD, 8 are purely used as storage class) and others are used as:

ip            port    cpu%  mem%  iops  net  class                 roles
------------  ------  ----  ----  ----  ---  --------------------  --------------------
 10.0.1.101    4500    5     3     -     3    cluster_controller    cluster_controller
               4501    1     13    8     0    transaction           log
               4502    0     4     -     0    transaction
               4503    1     3     -     0    resolution            resolver
------------  ------  ----  ----  ----  ---  --------------------  --------------------
 10.0.1.102    4500    3     3     -     1    master                master
               4501    0     3     -     0    transaction
               4502    1     16    8     0    transaction           log
               4503    0     2     -     0    resolution            resolver
------------  ------  ----  ----  ----  ---  --------------------  --------------------
 10.0.1.103    4500    1     2     -     0    proxy                 proxy
               4501    1     2     -     0    proxy                 proxy
               4502    1     2     -     0    proxy                 proxy
               4503    1     2     -     0    proxy                 proxy

Tester Instance: 4 * md5,xlarge (4vCPU, 16GiB and 500 SSD)
and when I try to run the RandomReadWrite tests (make multitest) with following configuration:

# FDB test definition
testTitle=RandomReadWriteTest
testName=ReadWrite
# you generally want to run the test at least 2 minutes
# to avoid edge effects when log servers are not yet
# continuously transferring data to storage servers
# ("burst" performance on short tests will be higher
# than "sustained" performance)
testDuration=200
# set max transactions lower if you want to measure
# latencies on a non-saturated cluster
transactionsPerSecond=240000
writesPerTransactionA=0
readsPerTransactionA=2
writesPerTransactionB=0
readsPerTransactionB=2
# transactions A/B
alpha=0.5
# key count and value min/max sizes
keyBytes=32
nodeCount=10000000
valueBytes=1000
minValueBytes=1000
# misc
discardEdgeMeasurements=false
warmingDelay=5.0
timeout=300000.0
databasePingDelay=300000.0
setup=true

it seems the output will get stuck at this even after about 80 minutes:

terraform init

Initializing provider plugins...

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
setting up test (RandomReadWriteTest)...

the set up process should be finished since I tried to show the fdb status before I cancel the operation and it all of the cpus and iops in the cluster are very low and they status is:

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Desired Proxies        - 4
  Desired Resolvers      - 2
  Desired Logs           - 2

Cluster:
  FoundationDB processes - 60
  Machines               - 15
  Memory availability    - 3.8 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Retransmissions rate   - 1 Hz
  Fault Tolerance        - 1 machine
  Server time            - 11/03/18 02:39:12

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 10.348 GB
  Disk space used        - 28.539 GB

Operating space:
  Storage server         - 488.9 GB free on most full server
  Log server             - 1.0 GB free on most full server

Workload:
  Read rate              - 20 Hz
  Write rate             - 1 Hz
  Transactions started   - 7 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

so, is this because of something wrong of ssh or just because it really needs a lot of time to prepare the database?

Your dataset should take around 10Gb and be created in reasonable time (under 1 min I think). ssh is used only to post command to fdb, it should perform data initialization and tests without ssh.

Maybe it is some weird problem like running out of disk space on log servers or something. Try to tweak around parameters and also look at request rate by running watch fdbcli --exec status while data is initializing and notice when the load drops.

@poma, yeah, there could be a warning in the status command:

 Performance limited by process: Log server running out of space (approaching 100MB limit).
  Most limiting process: 10.0.1.102:4502

but when I log in the instance and show the disk usage:

Filesystem      Size  Used Avail Use% Mounted on
udev            7.7G     0  7.7G   0% /dev
tmpfs           1.6G  8.8M  1.6G   1% /run
/dev/nvme0n1p1  485G  3.9G  481G   1% /
tmpfs           7.7G  4.0K  7.7G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup
/dev/loop0       88M   88M     0 100% /snap/core/5548
/dev/loop1       17M   17M     0 100% /snap/amazon-ssm-agent/734
/dev/loop2       88M   88M     0 100% /snap/core/5742
/dev/loop3       17M   17M     0 100% /snap/amazon-ssm-agent/784
tmpfs           1.6G     0  1.6G   0% /run/user/1000

how come the warning coming out?

The fact that it says there is 1 GB free on your most full log server and that you are approaching the 100MB limit rather than the 5% limit appears suspiciously like it is using the memory storage engine. Status is reporting that you are using the ssd storage engine, though, so that’s a bit strange.

Assuming you are certain that the disk being used by the log server has more than 1 GB free, maybe you could try restarting all the processes in your cluster to see if that helps (you can do this by running kill and then kill all in fdbcli). Ideally, you should see the operating space reported for the log servers go up to reflect how much space is available on disk.

If we are hitting some sort of bug here, it’d be helpful if we could reproduce it ourselves. Can you provide some details about how you are setting up the cluster that we could try to follow? Do you know if this is something you can reliably reproduce from scratch?

I’ve reproduced this myself and am looking into it. See Help me understand this status output, where I’ll provide any updates.