We used YCSB to benchmark FoundationDB, but the performance results did not show linear scaling

tiantian · September 22, 2025, 8:01am

The cluster consists of 35 nodes, each with 2 CPU cores and 8 GB of memory. Each node runs one fdbserver. Among them, there are 18 storage processes, 4 transaction processes, 1 grv_proxy, 3 commit_proxies, and 3 coordinator nodes; the rest are set to 1 process each. The database is initialized with triple replication, the storage engine is ssd-redwood-1, and each node’s configuration file was initialized as follows.

[fdbmonitor]
user = foundationdb
group = foundationdb
[general]
restart_delay = 60
cluster_file  = /etc/foundationdb/fdb.cluster
[fdbserver]
command        = /usr/sbin/fdbserver
public_address = ${SELF_IP}:${FDB_PORT}
listen_address = 0.0.0.0:${FDB_PORT}
datadir        = /data0/foundationdb/data/${FDB_PORT}
logdir         = /data0/foundationdb/log
memory         = 7GiB
cache-memory  =  4GiB
logsize = 10MiB
maxlogssize = 100MiB
memory-vsize = 0
locality-zoneid = ${CNP_ZONE}-${SELF_IP}
locality-data-hall = ${CNP_REGION}
io-trust-seconds = 20
[fdbserver.${FDB_PORT}]
class = ${FDB_CLASS_DEFAULT}During the YCSB benchmark, the key size was set to 128 and the value size to 1024. There were 3 clients, and each client ran the following workload:

During the YCSB benchmark, the key size was set to 128 and the value size to 1024. There were 3 clients, and each client ran the following workload:

FDB_CLUSTER_FILE=`pwd`/conf/154.cluster
TEST_DIR=`pwd`
nohup bin/ycsb.sh run foundationdb -s \
  -P workloads/workloada \
  -p table=usertable \
  -p recordcount=20000000 \
  -p insertstart=0 \
  -p operationcount=20000000 \
  -p insertorder=ordered \
  -p insertproportion=1 -p readproportion=0 -p updateproportion=0 -p scanproportion=0 \
  -p zeropadding=124 \
  -p fieldcount=1 -p fieldlength=1024 \
  -p foundationdb.apiversion=520 \
  -p foundationdb.clusterfile="$FDB_CLUSTER_FILE" \
  -p measurementtype=hdrhistogram \
  -p hdrhistogram.percentiles=95,99,99.9,99.99 \
  -threads 1000 > $TEST_DIR/write154.log 2>&1 &

FDB_CLUSTER_FILE=`pwd`/conf/154.cluster
TEST_DIR=`pwd`
nohup bin/ycsb.sh run foundationdb -s \
  -P workloads/workloada \
  -p table=usertable \
  -p recordcount=20000000 \
  -p insertstart=0 \
  -p operationcount=20000000 \
  -p scanproportion=1 -p readproportion=0 -p updateproportion=0 -p insertproportion=0 \
  -p requestdistribution=uniform \
  -p maxscanlength=100 \
  -p scanlengthdistribution=uniform \
  -p zeropadding=124 \
  -p fieldcount=1 \
  -p readallfields=true \
  -p foundationdb.apiversion=520 \
  -p foundationdb.clusterfile=$FDB_CLUSTER_FILE \
  -p measurementtype=hdrhistogram \
  -p hdrhistogram.percentiles=95,99,99.9,99.99 \
  -threads 1000 > $TEST_DIR/read154.log 2>&1 &

The benchmark results are shown below. Why is it that neither adding more storage processes nor adding more grv_proxies achieves the linear scaling shown in the official documentation? How should I configure the system to improve throughput, and which role should I expand for testing?

panghy · September 24, 2025, 7:00pm

Feel free to hit us up on Discord (where I believe you’ll get help faster). Seems like your range reads are scaling linearly but not the writes so the question would be whether you scaled up your tlogs during the benchmark (linearly as well)? You can also see what’s being saturated if you separate the roles by pinning them onto processes. Running a bank of storage-only processes, tlogs-only processes and stateless-only processes and then setting the counts of each role (which then gets mapped to them) is critical to understand what’s the bottleneck during a benchmark. That and also of course, dumping fdbcli output as json regularly during the run.

tiantian · September 25, 2025, 2:04am

We specified process classes in the .conf file, which would assume different roles, preventing a single node from assuming multiple roles. We then increased the number of specific roles for benchmarking. However, we encountered a problem. After scaling up the stoarge nodes to 1,000, the transaction count to 50, the resolution count to 10, the grv-proxy count to 50, the commit-proxy count to 50, and the stateless count to 10, the status displayed the following message for ten hours without changing, with the disk space used continuously increasing:

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Log engine             - ssd-2
  Encryption at-rest     - disabled
  Coordinators           - 5
  Desired Commit Proxies - 50
  Desired GRV Proxies    - 50
  Desired Resolvers      - 10
  Desired Logs           - 50
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 1170
  Zones                  - 1170
  Machines               - 1170
  Memory availability    - 7.0 GB per process on machine with least available
  Fault Tolerance        - 2 zones
  Server time            - 09/25/25 09:56:14

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 1.317 TB

After checking the “status json”, I found that data_distributor frequently switches between stateless processes, and the CPU usage is almost always above 90%. What is causing this problem? How can I solve this problem and restore the database to a healthy state?

panghy · September 25, 2025, 4:29am

That is a lot of fdbserver processes, you said 35 nodes in the first message, now you’re up to 1k. Was it scaling linearly before you went to 1k storage nodes? The numbers for commit proxies, GRV proxies, resolvers and logs seem arbitrary as well. My suggestion would be to check for saturation for each of the roles as you scale up your qps (note that fdb can take spikes until SS lags so you want to test sustained throughput). Once you understand how the YCSB workload (in terms of write qps) relates to each role, you should see sub-linear scaling when you increase the number of processes w.r.t. write qps.

Also, with that number of processes, linux/kernel tuning becomes a thing. You might want to check your syslog and definitely make sure you’re using the latest network drivers (even if you are running on cloud).

Topic		Replies	Views
We used YCSB to benchmark FoundationDB, but the performance did not show linear scaling Development performance , operator	8	438	October 11, 2025
Scalability performance benchmark Using FoundationDB performance	6	2816	March 27, 2019
Benchmarking FoundationDB on AWS Using FoundationDB	18	9204	October 30, 2018
FoundationdDB Cluster Performance Issues Using FoundationDB performance , operator	11	1523	October 21, 2020
Using FoundationDB with Samba Using FoundationDB	2	380	September 29, 2023

We used YCSB to benchmark FoundationDB, but the performance results did not show linear scaling

Related topics