Scaling Record Layer for Bulk Writes

i know this has been asked before , but this post covers a slightly different issue as compared to the other one.

We are using Record Layer to model our data and dealing with volumes in the range of ~10TB.
Below captures the essence of our Data model -

message Parent {
    int primary_id = 1; //Primary Key
    Nested1 nested1 = 2;
    Nested2 nested2 = 3;

message Nested1 {
    int foo = 1;
    int bar = 2;

message Nested2 {
    string baz = 1;
    string bat = 2;

We have a legacy Relational Database , from which we read records and attempt to first hydrate or bulk load the data into a record layer fronted FDB Cluster ( 6.2.19 ).
We have repeatedly run into bottlenecks on the write speed we can see on the cluster.

Cluster shape -

Number of machines - 16
Config - 
     RAM - 600 GB
     DISK - 5.5 TB ( NVMe SSDs )
     CORES - 40 
Resolvers - 4
Proxies - 4
Storage Processes - 84 ( 12 dedicated machines as storage servers )
Tlogs - 4 ( separate from Storage )

We use a SINGLE record store , since model-wise that makes sense for us.
Essentially, our keyspace looks as follows -


The prefix is constant for all records , varying only at the level of the primary_key, which is the identifier.

Our bulk loading processes run in parallel, to make more FDB network threads available to scale the write throughput, but we notice that invariably, after some time we see one of the storage processes running at 100% CPU, and as soon as that happens the cluster write throughput comes to a crawl.
I’m guessing RateKeeper kicks in here and limits the throughput.

Record Layer Specifics -
This is how we get the Record store keyspace -

private static final KEY_SPACE =
                        new DirectoryLayerDirectory("some_parent")
                                .addSubdirectory(new DirectoryLayerDirectory("some_static_id")
                                        .addSubdirectory(new DirectoryLayerDirectory("enviroment_value"))));

Finally ,

  • Is there a way we can scale the write throughput with the above model in place ?
  • We currently see write rate range around ~800000 hz, but the rate of increase of key-val sizes is ~3.5GB/min. Which is almost the same when the write rate were ~400000Hz. Is there a way to have a better control on the shard assignment? since i guess our prefix for keyspace is not allowing better partitioning of the data
1 Like

Hey @alloc , Tagging you in case you miss this. :slight_smile:

Are the parallel workers writing disjoint key ranges? The best case for FDB is when the writes are spread across the key space, so breaking the primary key space up into contiguous ranges, and giving each load process one range to process should help.

If you are already doing that, it’s likely that the writes will initially hit a single shard. FDB should detect this, and split the shard, allowing the writes to hit multiple machines.

When you see 100% load on one storage server, how busy are the other storage servers?

1 Like

Besides what @sears’s valuable points,

Empirically, the tLog to storage server ratio is 1:14. (It may change based on hardware types.)

For a better write throughput, it is, empirically, better to have more tLogs.

There are a range of keys which are written by each parallel process.
eg - Let’s suppose we have to write 100000 Ids ( Keys ) , and there are 10 processes , the distribution would be , P1 - 0-10000, P2 - 10001-20000 and so on.

When you see 100% load on one storage server, how busy are the other storage servers?
That is the interesting part, when one of the storage processes hits 100% , the rest are usually < 10%.
One fact which i probably did not mention is the updation of record layer indexes also happens in each transaction. We have around 30 indexes defined.

The distribution of keys looks OK.

For the sake of debugging, can you get rid of the indexes? If it fixes the load imbalance, then perhaps one of the indexes has a problematic access pattern. Add them back in a few at a time until the problem reproduces.

Your workload seems simple enough that adjusting it to narrow down the issue may be the quickest way.

But an alternative to consider is turning on client request sampling with

profile client set <RATE> <SIZE>

in fdbcli and then using DatabaseClientLogEvents and KeySpaceCountTree in Record Layer to analyze them.

Specifically, this allows you to get a hierarchical breakdown of sampled read/write requests by storage server by key range, decoded according to your key-space paths. It understands subspaces inside the record store keyspace like particular indexes.

We use it as part of some higher level tooling, but the unit tests should be enough to understand the API.