I have a question on how to achieve splitting a job into a bunch of small jobs, basically implementing something like a spark reader for FDB. So to do that I started down the path of using LocalityInfo.getBoundaryKeys, however I ran into an issue in that the way my keys are laid out. Since I use chunk splitting strategy, I can’t really have the boundaries be in the middle of a chunk, so I figured it’s probably best to adjust the boundaries based on my key scheme. As an example here is my key layout, (subspace, string, versionstamp(userVersion)), with (subspace, string, versionstamp(userVersion), number of chunks) and (subspace, string, versionstamp(userVersion), part0) and (subspace, string, versionstamp(userVersion), part1) representing the chunking. Is there a strategy folks use to break on some schema dependent rules? I noticed a function called getBlobGranuleRanges but it doesn’t look useful judging by the arguments, but its also fairly undocumented.
Figured out a solution. I essentially adjust the boundaries returned by getBoundaryKeys to make sure they don’t overlap in the middle of a chunk, which is probably the only way to model around this problem anyways.
Yeah, that sounds like the best solution to me, too. The Record Layer implemented something similar when it uses getBoundaryKeys to split up a range. Effectively, it has a bunch of records which are sorted by the record’s primary key. Each record can take up 1 or more adjacent keys, and we want to split on rough shard boundaries, but only on discrete primary keys. So we created this function: fdb-record-layer/FDBRecordStore.java at bc8f86077f7169d9b3ab81eb0e71959c5e27d778 · FoundationDB/fdb-record-layer · GitHub
It goes through each boundary key returned by the locality API and then does a scan for the first whole primary key that it can compute from each boundary key.