Getting a split of ranges to read while honoring chunking

dispalt · January 24, 2023, 7:55am

I have a question on how to achieve splitting a job into a bunch of small jobs, basically implementing something like a spark reader for FDB. So to do that I started down the path of using LocalityInfo.getBoundaryKeys, however I ran into an issue in that the way my keys are laid out. Since I use chunk splitting strategy, I can’t really have the boundaries be in the middle of a chunk, so I figured it’s probably best to adjust the boundaries based on my key scheme. As an example here is my key layout, (subspace, string, versionstamp(userVersion)), with (subspace, string, versionstamp(userVersion), number of chunks) and (subspace, string, versionstamp(userVersion), part0) and (subspace, string, versionstamp(userVersion), part1) representing the chunking. Is there a strategy folks use to break on some schema dependent rules? I noticed a function called getBlobGranuleRanges but it doesn’t look useful judging by the arguments, but its also fairly undocumented.

Any advice would be appreciated!

dispalt · January 25, 2023, 5:06am

Figured out a solution. I essentially adjust the boundaries returned by getBoundaryKeys to make sure they don’t overlap in the middle of a chunk, which is probably the only way to model around this problem anyways.

alloc · January 25, 2023, 10:23am

Yeah, that sounds like the best solution to me, too. The Record Layer implemented something similar when it uses getBoundaryKeys to split up a range. Effectively, it has a bunch of records which are sorted by the record’s primary key. Each record can take up 1 or more adjacent keys, and we want to split on rough shard boundaries, but only on discrete primary keys. So we created this function: fdb-record-layer/FDBRecordStore.java at bc8f86077f7169d9b3ab81eb0e71959c5e27d778 · FoundationDB/fdb-record-layer · GitHub

It goes through each boundary key returned by the locality API and then does a scan for the first whole primary key that it can compute from each boundary key.

Topic		Replies	Views
Slicing a key range to work with analytical engines (e.g. Spark) Using FoundationDB	4	621	July 7, 2020
Recommended usage of get_range_split_points or GetRangeSplitPoints Using FoundationDB	2	738	February 14, 2023
Issues with get_range_split_points returning chunks of very uneven size Using FoundationDB bindings , performance	1	296	May 17, 2023
Keyspace partitions & performance Using FoundationDB	8	5825	April 22, 2018
Shard marker for log-like data structures FoundationDB Core performance	15	2259	July 13, 2020

Getting a split of ranges to read while honoring chunking

Related topics