Keyspace partitions & performance

Evan · April 21, 2018, 7:28am

We automatically split shards that are receiving too much write bandwidth. We cannot split a single key into multiple shards, but the storage servers responsible for that hot key will give their other shard responsibility to different servers to balance the load.

Read load balancing happens on the client. In triple replication every key is held by three different storage servers. The client keeps track of the latency for each of those servers and sends the request to the least loaded server (the actual algorithm is slightly different than what I just described, but accomplishes the same result). If one key is really hot, the storage servers that responsible for that key will get heavily loaded. Clients will know those servers are heavily loaded, so are read requests that has the option to avoid those servers will go somewhere else. In practice this does an amazing job of keeping all the servers utilized.

There are definitely use cases where you may want to hash keys preemptively. In fact, our backup and DR work by storing a copy of the mutation log into the system keyspace. The key is the sequentially increasing version of the database, so to avoid overloading one team of storage servers with half the write bandwidth of the cluster we hash the version and put that in the key like you described. In our case we still want to do range reads, so we actually hash version/1e6, but the concept is the same.

FDB attempts to divide the database in shards of roughly equal size (both in terms of bytes and write bandwidth). If it notices that one ranges of keys has too much data, it will split it into smaller ranges, and each range will get assigned to a new set of storage servers. For each range, the algorithm starts with 4 random candidate teams of storage servers, and then assigns the range to the one that has the least amount of data. This algorithm converges to an optimal distribution while avoiding the herding effects of always attempting to move data to the least loaded servers.

Our data movement algorithm does not react instantly to changes in the workload. If all of your writes are directed at a new subspace every 30 seconds you will run into performance problems, and you should consider splitting the writes. If the writes are changing subspaces every hour it is probably okay.

Topic		Replies	Views
Data distribution and rebalancing Using FoundationDB	3	1316	October 28, 2023
Data distribution / Disk usage uneven: bifurcated at 2 tiers Using FoundationDB	6	530	August 6, 2020
Sessions storage. Data modeling, sharding strategy Using FoundationDB performance	1	507	November 13, 2020
How to speed up balancing? Using FoundationDB performance	11	1535	August 21, 2019
As a table grows, is it automatically split into multiple tablets FoundationDB Core	5	1748	April 24, 2018

Keyspace partitions & performance

Related topics