Sorry for not getting to this earlier; it had somehow escaped my attention. I think I know what’s going on here, but it’s somewhat subtle, and relates to how the Record Layer stores records at kind of a low-level.
tl;dr the Record Layer does not support having two records where the primary key of one is a prefix of the other’s. I thought there was an Issue about this on GitHub, but I couldn’t find it. At the very least, there was at one point somewhere on the agenda something about adding a check to the meta-data validator to ensure that primary keys were always prefix free, but that might not have been moved to the current issue-tracking system.
So, essentially, when the Record Layer serializes a record to disk, it splits the serialized record across multiple key/value pairs, with each value having at most 100 kB in it, as FDB allows for at most 100 kB per value. To do this, it writes keys that look like (ignoring some prefix bytes that are common for all records in the store):
record.primaryKey() + (split_index,) -> some_serialized_bytes
So, if you have a record with a primary key of
(100, 1) (like the second
order record in your example) and it is 250 kB in size, you will store the first 100 kB in a key ending in
(100, 1, 1) (packed as a tuple), the next 100 kB will be in a key ending in
(100, 1, 2), and the final 50 kB will be in a key ending in
(100, 1, 3).
However, there are two special split indexes: a split index of 0 indicates that the record is “unspilt” (i.e., does not exceed 100 kB in size so fits in a single key) and a split index of -1 is used to store the record’s “version” (if the feature is enabled) (see: Overview: Indexing by Version). Note that there are good technical reasons for why the version is not stored within the record itself (that I could go into if desired), but that isn’t super relevant to this discussion.
So, taking your example and using tuples instead of bytes (as
\x14 is zero, tuple encoded), we have:
(100, null, 0) -> order record 1 (as bytes)
(100, 0) -> different order record (as bytes)
(100, 1, 0) -> order record 2 (as bytes)
When a forward scan is performed, the first key we read is
(100, null, 0). From this key, we can deduce that (1) the primary key of the record is
(100, null) (just by chopping off the trailing split index of 0) and (2) that the record is not split (because the split index is 0). This means there is no more data to read for that record, so the record is deserialized (huzzah!) and returned to the user.
The next key is
(100, 0), and from that we conclude that the primary key is
(100,) and that we can deserialize and return the second record, and then this again happens for the final record.
Now, when we scan in reverse, the following happens: the first key read is
(100, 1, 0). This means that the primary key must be
(100, 1), BUT the scan doesn’t know (yet) that this is the only key for that record even though it has a split index of 0, and that’s because there might be a version associated with that record stored at key
(100, 1, -1). So it scans another key, which is
(100, 0). Because this key doesn’t begin with
(100, 1) ( the primary key for the record), it must be the key for another record. At this point the scanner knows it has all of the data for record with primary key
(100, 1), so it can deserialize it and return it to the user.
Now, the reverse scanner at this point knows it has scanned
(100, 0), and from that, it can conclude that this key must be associated with a record with primary key
(100,), and (again) that the record is not split. However, yet again, it doesn’t know that this is the last key for the record, as there might be a version for the record stored at key
(100, -1). So, it reads another key. This time, it sees that the key, which is
(100, null, 0) does begin with
(100,), so it think that that key contains more information about the record with primary key
(100,), so it tries to inspect the split index. But this time the split index is
null (ah!) and an NPE results.
But note that though we got a null pointer exception in this case, the results could be worse. For example, suppose you had a split record with primary key
(100,), so you write to keys
(100, 1) and
(100, 2). But then if you have an unsplit record with primary key
(100, 1) that you then write to key
(100, 1, 0). Well,
(100, 1) <
(100, 1, 0) <
(100, 2), so the record will be written in between the data for the other record. Yikes!
As for solutions, you could:
- Ensure all primary keys of all types have the same number of columns (e.g., the same number of fields in the key expression). The problem only arises when one record’s primary key is the strict prefix of another (and can therefore have its data intermingled with another type’s data). However, you can still have two records with the exact same primary key (and different types), but they will cause problems in a different way (namely, one can overwrite the value of the other).
- Prefix all primary keys of all records with the
recordType() key expression. (See: FAQ: Are record types tables? and Key.Expressions.recordType()) This will put a value at the beginning of the primary key that is unique for each record type, and this ensures that no two records have primary keys where one is a strict prefix of a another as (1) if the records are of different types, then the first column is different and (2) if the records are the same type, then they must have the same number of columns (a property of key expressions). This has other benefits, too, as it allows the record layer to make some optimizations because it knows records of the same type are grouped together.
- Come up with your own plan to avoid prefix collisions.
As for whether the Record Layer should handle this on its own, I think there are a couple different approaches:
- Throwing an error when validating the meta-data if either (1) the primary keys do not have the same number of columns and (2) are not prefixed by the type key. (Essentially mandating either (1) or (2) above.) If you ignore
null values (which could be safe if the field is guaranteed non-null), you can do additional validation based on the types of the values in the key expression.
- Adding more checks at the time of writing a record. The problem here though is that while it is straightforward to check to make sure that the given record is not a prefix of any other record (by scanning for keys with the prefix–a check it has to do anyway to know if there is a record already present at that key that it is then going to replace), scanning for other records that might be a prefix of it is less so. (It would, at the least, make writing every record more expensive.)
- Changing the on disk format (gasp!) in a way that doesn’t have this limitation, though this would require some thought as to how it is rolled out. In particular (and I understand I’ve already rambled on a bunch and this is getting fairly deep into the gritty details here), the FDB tuple layer supports nested tuples, and I think that if instead of encoding primary keys with their split points as
primaryKey() + (split_index,) (i.e., concatenating two tuples where the last entry is the split index and rest are the primary key) it was encoded as
(primaryKey(), split_index) (i.e., a 2-tuple with the first entry being the (nested) primary key tuple and the second entry being the split index), then I think…everything just works? The entries now really would be grouped by primary key and if one primary key was a prefix of another, the way we encode nested primary keys would ensure every single entry of the shorter key was before every single entry of the second, and you could always tell (when doing a scan) when the scan “switches” from one primary key to another.
I do kind of like the third approach on a philosophical level, but it is also incompatible with the current way data are serialized to disk, so we would need to do something to avoid, like, accidentally making all data unreadable on all existing record stores.