Scaling Record Layer for Bulk Writes

rahul-nitkkr · July 16, 2020, 4:18pm

i know this has been asked before , but this post covers a slightly different issue as compared to the other one.

We are using Record Layer to model our data and dealing with volumes in the range of ~10TB.
Below captures the essence of our Data model -

message Parent {
    int primary_id = 1; //Primary Key
    Nested1 nested1 = 2;
    Nested2 nested2 = 3;
}

message Nested1 {
    int foo = 1;
    int bar = 2;
}

message Nested2 {
    string baz = 1;
    string bat = 2;
}

We have a legacy Relational Database , from which we read records and attempt to first hydrate or bulk load the data into a record layer fronted FDB Cluster ( 6.2.19 ).
We have repeatedly run into bottlenecks on the write speed we can see on the cluster.

Cluster shape -

Number of machines - 16
Config - 
     RAM - 600 GB
     DISK - 5.5 TB ( NVMe SSDs )
     CORES - 40 
Resolvers - 4
Proxies - 4
Storage Processes - 84 ( 12 dedicated machines as storage servers )
Tlogs - 4 ( separate from Storage )

We use a SINGLE record store , since model-wise that makes sense for us.
Essentially, our keyspace looks as follows -

/<some_parent>/<some_static_id>/<enviroment_value>/<record-store-name>/<message_type>/<primary_key>

The prefix is constant for all records , varying only at the level of the primary_key, which is the identifier.

Our bulk loading processes run in parallel, to make more FDB network threads available to scale the write throughput, but we notice that invariably, after some time we see one of the storage processes running at 100% CPU, and as soon as that happens the cluster write throughput comes to a crawl.
I’m guessing RateKeeper kicks in here and limits the throughput.

Record Layer Specifics -
This is how we get the Record store keyspace -

private static final com.apple.foundationdb.record.provider.foundationdb.keyspace.KeySpace KEY_SPACE =
                new com.apple.foundationdb.record.provider.foundationdb.keyspace.KeySpace(
                        new DirectoryLayerDirectory("some_parent")
                                .addSubdirectory(new DirectoryLayerDirectory("some_static_id")
                                        .addSubdirectory(new DirectoryLayerDirectory("enviroment_value"))));

Finally ,

Is there a way we can scale the write throughput with the above model in place ?
We currently see write rate range around ~800000 hz, but the rate of increase of key-val sizes is ~3.5GB/min. Which is almost the same when the write rate were ~400000Hz. Is there a way to have a better control on the shard assignment? since i guess our prefix for keyspace is not allowing better partitioning of the data

rahul-nitkkr · July 16, 2020, 4:20pm

Hey @alloc , Tagging you in case you miss this.

sears · July 23, 2020, 8:54pm

Are the parallel workers writing disjoint key ranges? The best case for FDB is when the writes are spread across the key space, so breaking the primary key space up into contiguous ranges, and giving each load process one range to process should help.

If you are already doing that, it’s likely that the writes will initially hit a single shard. FDB should detect this, and split the shard, allowing the writes to hit multiple machines.

When you see 100% load on one storage server, how busy are the other storage servers?

mengxu · July 24, 2020, 5:07am

Besides what @sears’s valuable points,

Empirically, the tLog to storage server ratio is 1:14. (It may change based on hardware types.)

For a better write throughput, it is, empirically, better to have more tLogs.

rahul-nitkkr · July 24, 2020, 5:31am

There are a range of keys which are written by each parallel process.
eg - Let’s suppose we have to write 100000 Ids ( Keys ) , and there are 10 processes , the distribution would be , P1 - 0-10000, P2 - 10001-20000 and so on.

When you see 100% load on one storage server, how busy are the other storage servers?
That is the interesting part, when one of the storage processes hits 100% , the rest are usually < 10%.
One fact which i probably did not mention is the updation of record layer indexes also happens in each transaction. We have around 30 indexes defined.

sears · July 24, 2020, 6:40pm

The distribution of keys looks OK.

For the sake of debugging, can you get rid of the indexes? If it fixes the load imbalance, then perhaps one of the indexes has a problematic access pattern. Add them back in a few at a time until the problem reproduces.

MMcM · July 24, 2020, 8:28pm

Your workload seems simple enough that adjusting it to narrow down the issue may be the quickest way.

But an alternative to consider is turning on client request sampling with

profile client set <RATE> <SIZE>

in fdbcli and then using DatabaseClientLogEvents and KeySpaceCountTree in Record Layer to analyze them.

Specifically, this allows you to get a hierarchical breakdown of sampled read/write requests by storage server by key range, decoded according to your key-space paths. It understands subspaces inside the record store keyspace like particular indexes.

We use it as part of some higher level tooling, but the unit tests should be enough to understand the API.

Topic		Replies	Views
Scaling issues with FDB for write throughput Running FoundationDB	6	1889	September 14, 2020
How to troubleshoot throughput performance degrade? Using FoundationDB performance	35	4415	June 20, 2019
Transaction/operation throughput Using FoundationDB performance	10	2075	January 23, 2020
Keyspace partitions & performance Using FoundationDB	8	5935	April 22, 2018
Why doesn't my cluster performance scale when I double the number of machines? Using FoundationDB performance	20	3340	August 17, 2018

Scaling Record Layer for Bulk Writes

Related topics