Data distribution and rebalancing

I have read the official doc about data distribution and the some in-depth discussion on the topic:
https://github.com/apple/foundationdb/blob/master/design/data-distributor-internals.md
https://forums.foundationdb.org/t/keyspace-partitions-performance/168

I have a general concept how sharding is done within FDB based on the doc and discussion. It’s not clear to me how to optimize the performance for my use case of initial batch loading when the FDB is empty.

We have about 32 clients to load data concurrently to a FDB cluster with 9 SS, 3 proxies, and 3 log processes. The data set is split pretty evenly, but we noticed a degrading performance during the run. The first batch of tasks took about 20 to 30 seconds each, while the later batches took about 2 minutes. The data volume is about same (450K entries), and we commit every 30K entries (about 1M on key/value size). We noticed large “Moving data” in status during the run, which is likely happen due to repartition since the FDB is empty in beginning. Here is an example:

Data:
 Replication health   - Healthy (Repartitioning)
 Moving data      - 4.209 GB
 Sum of key-value sizes - 3.365 GB
 Disk space used    - 11.688 GB

So the question is how to avoid/optimize the repartition during initial batch loading when there are dominant write (almost no read) without range allocation on server since it is empty initially? We tried hashed key over original sequential key, but seems not much changed. Since we are using a subspace through directory layer, would it limited the hash effect due to common prefix even it’s short?

Thanks,

Randy

I am interested in this and since this question has been asked for two years, would someone be able to answer it?

You can try temporary disable DD (https://github.com/apple/foundationdb/blob/7.1.43/fdbcli/DataDistributionCommand.actor.cpp#L89)

Usage: datadistribution <on|off|disable <ssfailure|rebalance>|enable <ssfailure|rebalance>>

Loading data into FDB has the problem you encountered. The reason is that the empty database starts with 3 seed storage servers as a “team”, and the data are written to this one seed “team”. As these 3 storage servers get more data, DD starts to repartition, i.e., load balance, data into other storage servers (by recruiting new storage teams). This repartitioning process is trying to find a split point within a shard, and move a key range out.

So there are at least a couple of problems here.

  1. Initially, not all storage servers are utilized, because the DB starts with just one team. It’d be nice if DD can reacts quickly to recruit many teams and distribute load by assigning key ranges to different teams. In recent releases, we have added changes to allow manually specifying shard boundary keys, which can help to some degree, but we haven’t tested its effectiveness yet. This is on our TODO list.
  2. If the loading is writing data sequentially, then it’s better for DD to figure that out and split the shard at the last key written, i.e., making all new data to form a separate shard on a different team. This might be complicated if the workload uses hashed keys.

The above are all not great news. However, in 7.3, we are introducing “sharded rocksdb” engine, which allows us to load data into FDB via copying RocksDB files. Again, this is work in progress and needs a storage engine change.