Scaling issues with FDB for write throughput

Hi,

We are seeing some concerns when we scale FDB but not able to scale on throughput. We are running a 14 node cluster. 3 i3xl and 11 i34xl. the i3xl are just used to run transaction processes. Currently our workload is write heavy. When we hit 100K RPS in the cluster, the storage lag increases and cluster goes to a unhealthy state.

Adding some more info.

Added 2 more storage nodes, now we have a 16 node FDB cluster.
3 i3.xl (Tx nodes) + 13 i3.4xl (SP nodes)

FoundationDB 6.2 (v6.2.15)

Configuration:
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 6
Desired Resolvers - 6
Desired Logs - 6

Cluster:
FoundationDB processes - 220
Zones - 16
Machines - 16
Memory availability - 7.3 GB per process on machine with least available
Retransmissions rate - 3 Hz
Fault Tolerance - 1 machine
Server time - 09/03/20 04:43:55

Data:
Replication health - Healthy (Repartitioning.)
Moving data - 0.549 GB
Sum of key-value sizes - 574.755 GB
Disk space used - 7.380 TB

Operating space:
Storage server - 1182.3 GB free on most full server
Log server - 824.9 GB free on most full server

Workload:
Read rate - 43579 Hz
Write rate - 95600 Hz
Transactions started - 277 Hz
Transactions committed - 260 Hz
Conflict rate - 3 Hz
Performance limited by process: Storage server performance (storage queue).

Each i3.4xl has 12 storage processes running (6/disk), rest 4 are stateless process (3 nodes have 1 coordinators).
All 3 i3.xl nodes run 2 Tx process each (They also have 1 coordinator + 1 stateless process running).

Even with 40-50K RPS + 50-80K WPS, we see cluster going into unhealthy state. As per the FDB performance bottleneck reason, it shows - ‘storage_server_write_queue_size’
Disk busy is under 60% (fdb_cluster_processes_disk_busy). Disk reads: 3K-6K, Disk writes : 10K-30K.

We do see storage process queue going up to 1GB randomly on different SPs. Does not look it is always hitting same SP or same host.
(fdb_cluster_processes_roles_storage_input_bytes_counter - fdb_cluster_processes_roles_storage_durable_bytes_counter)

See SP CPU hitting 100%.

Each write transaction has ~200 records, total Tx size is limited below 1MB.

We also see fair bit of Tx conflicts.

Initially cluster size was 6 node (3Tx+3SP), we don’t see linear performance as we added 13 more SP nodes. Trying to see where the current bottleneck is. Any help is much appreciated.

Can you describe the workload a bit more?

The information given here says each record is around 5KB, which would put your total write bandwidth at 80000*5000b, or 400MB/s (800MB/s with double replication).

It looks like you’ve got 156 storage processes in this cluster.

If the workload was perfectly evenly distributed, you’re writing over 5MB/s per storage process, which is outside the range I would consider sustainable on the existing SSD storage engine.

If your workload is not evenly distributed, which the most non-overlapping spikes in storage queue size suggest (in my opinion), you may be hot-spotting some storage servers far above that 5MB/s.

Thank You @ryanworl !

Yes, we have 156 storage processes running.
Looking at the actual payload (tcpdump), each record varies between 100-400 bytes.
The key comprises of many fields and starts with ip of machine sending event.
We also have secondary index which is based on current system timestamp. Currently we are trying to see if this is causing hotrange issues.

Also, could you please explain more about

you’re writing over 5MB/s per storage process, which is outside the range I would consider sustainable on the existing SSD storage engine"?

What is recommended?

Thanks for the update.

(100-400B * 80,000WPS * 2x replication) / 156 processes = 100KB/s-400KB/s / process

I would say that is sustainable all other things being equal. I would target 1MB/s, so if you’re doing a delete to expire that data after a period of time, your sustainable write rate is approximately 500KB/s because deleting data requires re-writing those pages.

Thanks for giving more details about the workload.

We also have secondary index which is based on current system timestamp. Currently we are trying to see if this is causing hotrange issues.

I would guess this is the problem. If your index key something like this looks like this:

[TIMESTAMP, IP:PORT, ..., EventID] => ''

You’ll be targeting approximately one shard at any given time, and your cluster will spend tons of IO on splitting the shard and copying over and over again.

RocksDB is coming as a storage engine option in the future, which may make this pattern sustainable at a higher write rate, but it will never be ideal because shard splits will still take place. I wouldn’t base your decision to use FDB or not for this workload based on the potential that RocksDB is available in the future though.

Instead, you can add a prefix to that key, perhaps by hashing the IP address modding it by some constant to make a prefix you can scan if you know the IP address you need to query. The constant would be large enough to create enough shards to sustain your write rate. You’ll still have splits occurring, just at a slower rate.

Please see this thread for some discussion on possible disk saturation and how to reason about it.

Thank You @gaurav for the useful info.
However in our case, we found that the secondary index was causing hot ranges. Thank you @ryanworl for pinpointing the issue.