Process Classes: Initial vs Production Setup

Hi,

We are deploying a FoundationDB cluster on AWS. Due to a recent outage we have to recreate our cluster. FDB is used a storage of key-value pairs with a Redis Cluster in front serving as a cache. Whenever the Redis Cluster reaches a memory utilization limit we dump some of the older keys into FDB to free up space. We also retrieve the keys from FDB if requested and key is missing in the cache.

We had used a kind of default (lazy) setup (process classes were not set explicitly) and it worked well for the normal case.
However now Iā€™m facing issues when trying to bulk load the data from a Redis Cluster (Iā€™ve recreated all the keys in a larger cluster ~ 780M keys). The issue that I discovered is the disk saturation on all servers (tlog and storage classes are combined into a single process)

Here Iā€™d like to ask for advice on how the setup for initial bulk load of data should differ from a regular setup (where reads/writes are equally likely and at a much lower rate).

One simple improvement is setting the redundancy setting to ā€˜singleā€™ (during bulk load the cluster constantly enters the ā€œHealthy (Repartitioning)ā€ state with triple redundancy setting).

Iā€™ve read that increasing the tlog processes (alongside disks and cpu cores) could help although didnā€™t really understood why.

Any other advice you could suggest for the initial write-intensive setup?

My goal is to achieve write throughput above 20K/s. The key-value pairs are written in a transaction (4096 pairs per transaction).

1 Like

To begin with, FDB is not best known for its bulk loading capabilities. I think, it mainly boils down to two reasons

  • Every write has to go through transaction subsystem. No way to write directly to storage servers even when you donā€™t care about ACID properties
  • SLQLite storage engine is very sequential in its disk IO. Even if you are running on storage like NVMe or EBS, which gives you high IOPS, it would not do more than 2 to 5 disk I/O in parallel. Community is trying to address this with RocksDB and RedWood storage engines.

Having said that, there are ways to get the best out of current configuration. Couple of things worked for us in the past. I am sure there are other tricks I am not aware of

  • Reduced replication
    • As you mentioned single replication would help, but probably not worth the risk. Before you get to change the replication back to desired, if you loose a disk, you have to reload all of your data. I think double replication is good compromise.
  • Transaction size
    • Not sure what is your key/value size. FDB maximum transaction size is 10MB. Although having larger transactions are good way to get bandwidth, in my experience having too big would makes things worse. In the past, I managed to get sweet spot between 512KB to 1MB transaction sizes.
  • Conflict ranges
    • If you are doing too many sets in your transaction, by default FDB creates a range for each set. Instead, I suggest setting conflict ranges manually for each transaction to one range (or as few as you can). That reduces serialization work and conflict resolution.
  • Memory Storage engine
    • Depending on your working set size, you could consider memory storage engine. It gives you at least 3x bandwidth compared to SQLLite and yet persists your data. You can later convert the storage engine
  • Run more servers
    • As explained above FDB can not utilize disk IOPS very well. So, TLog and storage server disk throughput is bound by disk latency. I suggest running more servers. For example, instead of using 1 server for 1TB EBS disk, run two storage servers at 512GB each. If you are running on local disk, run multiple servers on same disk.
    • NOTE: You have to be careful not scaling the cluster too big, too many storage servers increases work for tlogs. Too many tlogs increase work for proxies. I suggest keep playing with different sizes.
  • Disable Backup
    • Disable FDB backup if you are using it, temporarily during your initial bulk load. As FDB does dual writes to preserve mutation logs, your write bandwidth would double if you disable backup.
2 Likes

Thanks for the detailed answer @mbhaskar

Our transaction size is definitely under 2MB as each item is well below 512B range. Not sure what you mean by ā€œconflict rangesā€ but during the initial load in our case each key is updated exactly once.

Iā€™ve tried re-running the load test (on 9 fdb servers) with 3 storage + 6 tlog and 3 tlog + 6 storage nodes. Write performance is about 16K keys per second in each case. If however I have a server with storage + tlog using same disk performance degrades (from 16K to 10K and probably lower, didnā€™t run it for too long). Also replication factor slows down things due to repartitioning while the test is being run.