Number types in Tuples may not be null | Reverse scan

Hi,

I am facing a certain error while doing a reverse scan namely “java.lang.NullPointerException: Number types in Tuples may not be null”.
Repro :

message order { int64 order_id = 1; int32 price = 2:}
message differentOrder { int64 order_id = 1;}

The primary keys for Order is “concat(order_id, price)” and for differentOrder is “order_id”. Let’s assume the data I’m migrating is:
Order: order_id = 100, price = 0 ; order_id = 100, price = 1.
DifferentOrder: order_id = 100.
Since record layer handles 0 as null, the way it is getting stored in FDB (lexicographic order) is:

<100 encoded>\x00\x14    // translates to order Record Type | order_id = 100, price = 0
100\x14                              // translates to diffferentOrder Record Type | order_id = 100
100\x01\x14                       // translates to order Record Type | order_id = 100, price = 1

When I do a reverse scan (or even a reverse Sort) on this data, it throws me the error the moment it comes across differentOrder entry. Am I missing out on something here, because it works fine for a forward scan? If not, is this behavior expected?

Thanks.

Sorry for not getting to this earlier; it had somehow escaped my attention. I think I know what’s going on here, but it’s somewhat subtle, and relates to how the Record Layer stores records at kind of a low-level.

tl;dr the Record Layer does not support having two records where the primary key of one is a prefix of the other’s. I thought there was an Issue about this on GitHub, but I couldn’t find it. At the very least, there was at one point somewhere on the agenda something about adding a check to the meta-data validator to ensure that primary keys were always prefix free, but that might not have been moved to the current issue-tracking system.


So, essentially, when the Record Layer serializes a record to disk, it splits the serialized record across multiple key/value pairs, with each value having at most 100 kB in it, as FDB allows for at most 100 kB per value. To do this, it writes keys that look like (ignoring some prefix bytes that are common for all records in the store):

record.primaryKey() + (split_index,) -> some_serialized_bytes

So, if you have a record with a primary key of (100, 1) (like the second order record in your example) and it is 250 kB in size, you will store the first 100 kB in a key ending in (100, 1, 1) (packed as a tuple), the next 100 kB will be in a key ending in (100, 1, 2), and the final 50 kB will be in a key ending in (100, 1, 3).

However, there are two special split indexes: a split index of 0 indicates that the record is “unspilt” (i.e., does not exceed 100 kB in size so fits in a single key) and a split index of -1 is used to store the record’s “version” (if the feature is enabled) (see: Overview: Indexing by Version). Note that there are good technical reasons for why the version is not stored within the record itself (that I could go into if desired), but that isn’t super relevant to this discussion.

So, taking your example and using tuples instead of bytes (as \x14 is zero, tuple encoded), we have:

(100, null, 0) -> order record 1 (as bytes)
(100, 0) -> different order record (as bytes)
(100, 1, 0) -> order record 2 (as bytes)

When a forward scan is performed, the first key we read is (100, null, 0). From this key, we can deduce that (1) the primary key of the record is (100, null) (just by chopping off the trailing split index of 0) and (2) that the record is not split (because the split index is 0). This means there is no more data to read for that record, so the record is deserialized (huzzah!) and returned to the user.

The next key is (100, 0), and from that we conclude that the primary key is (100,) and that we can deserialize and return the second record, and then this again happens for the final record.

Now, when we scan in reverse, the following happens: the first key read is (100, 1, 0). This means that the primary key must be (100, 1), BUT the scan doesn’t know (yet) that this is the only key for that record even though it has a split index of 0, and that’s because there might be a version associated with that record stored at key (100, 1, -1). So it scans another key, which is (100, 0). Because this key doesn’t begin with (100, 1) ( the primary key for the record), it must be the key for another record. At this point the scanner knows it has all of the data for record with primary key (100, 1), so it can deserialize it and return it to the user.

Now, the reverse scanner at this point knows it has scanned (100, 0), and from that, it can conclude that this key must be associated with a record with primary key (100,), and (again) that the record is not split. However, yet again, it doesn’t know that this is the last key for the record, as there might be a version for the record stored at key (100, -1). So, it reads another key. This time, it sees that the key, which is (100, null, 0) does begin with (100,), so it think that that key contains more information about the record with primary key (100,), so it tries to inspect the split index. But this time the split index is null (ah!) and an NPE results.

But note that though we got a null pointer exception in this case, the results could be worse. For example, suppose you had a split record with primary key (100,), so you write to keys (100, 1) and (100, 2). But then if you have an unsplit record with primary key (100, 1) that you then write to key (100, 1, 0). Well, (100, 1) < (100, 1, 0) < (100, 2), so the record will be written in between the data for the other record. Yikes!

As for solutions, you could:

  1. Ensure all primary keys of all types have the same number of columns (e.g., the same number of fields in the key expression). The problem only arises when one record’s primary key is the strict prefix of another (and can therefore have its data intermingled with another type’s data). However, you can still have two records with the exact same primary key (and different types), but they will cause problems in a different way (namely, one can overwrite the value of the other).
  2. Prefix all primary keys of all records with the recordType() key expression. (See: FAQ: Are record types tables? and Key.Expressions.recordType()) This will put a value at the beginning of the primary key that is unique for each record type, and this ensures that no two records have primary keys where one is a strict prefix of a another as (1) if the records are of different types, then the first column is different and (2) if the records are the same type, then they must have the same number of columns (a property of key expressions). This has other benefits, too, as it allows the record layer to make some optimizations because it knows records of the same type are grouped together.
  3. Come up with your own plan to avoid prefix collisions.

As for whether the Record Layer should handle this on its own, I think there are a couple different approaches:

  1. Throwing an error when validating the meta-data if either (1) the primary keys do not have the same number of columns and (2) are not prefixed by the type key. (Essentially mandating either (1) or (2) above.) If you ignore null values (which could be safe if the field is guaranteed non-null), you can do additional validation based on the types of the values in the key expression.
  2. Adding more checks at the time of writing a record. The problem here though is that while it is straightforward to check to make sure that the given record is not a prefix of any other record (by scanning for keys with the prefix–a check it has to do anyway to know if there is a record already present at that key that it is then going to replace), scanning for other records that might be a prefix of it is less so. (It would, at the least, make writing every record more expensive.)
  3. Changing the on disk format (gasp!) in a way that doesn’t have this limitation, though this would require some thought as to how it is rolled out. In particular (and I understand I’ve already rambled on a bunch and this is getting fairly deep into the gritty details here), the FDB tuple layer supports nested tuples, and I think that if instead of encoding primary keys with their split points as primaryKey() + (split_index,) (i.e., concatenating two tuples where the last entry is the split index and rest are the primary key) it was encoded as (primaryKey(), split_index) (i.e., a 2-tuple with the first entry being the (nested) primary key tuple and the second entry being the split index), then I think…everything just works? The entries now really would be grouped by primary key and if one primary key was a prefix of another, the way we encode nested primary keys would ensure every single entry of the shorter key was before every single entry of the second, and you could always tell (when doing a scan) when the scan “switches” from one primary key to another.

I do kind of like the third approach on a philosophical level, but it is also incompatible with the current way data are serialized to disk, so we would need to do something to avoid, like, accidentally making all data unreadable on all existing record stores.

Hi Alec,

Thanks for helping out. I have a follow up question w.r.t one of the approaches that you suggested, though it is unrelated to the topic mentioned.

Prefixing all the primary keys of all the records with the recordType() key expression solves the said problem (and others) but I have a doubt w.r.t the behavior of the following Query:

Let’s assume that I want to get the entries of order where order_id < 50,000 and I want to do a reverse sort on concat(recordtype/order_id/price). According to the planner behavior of “.setSort()”, if the sort matches an index that can satisfy a filter, then the index is used. I am unable to replicate this behavior for the following query.

Query :
RecordQuery.newBuilder() .setFilter(Query.and(Query.field("order_id").lessThan(50000L), Query.field("price").equalsValue(4)) .setSort(concat(key.Expressions.recordType(), key.Expressions.field("order_id"), key.Expressions.field("price")) .build()

My understanding of why the planner is not able to take the filter that satisfies the index ( or primary key) is that the QueryComponent created does not contain the recordType “field” and hence sort does not match the filter index.Is my understanding correct?
I tried to set a filter Query.keyExpression(Key.Expressions.recordType()) but it throws an Exception “query key expression must be queryable”. ( which makes sense )

I essentially want the data-set of the reverse sort to be filtered/reduced. ( I have additional filters on top of this subset of data) Is there a way to do this that I am missing out on? (That does not involve creating a separate index on “order_id” )

Thanks.

Though I haven’t looked into it too deeply, this kind of looks like https://github.com/FoundationDB/fdb-record-layer/issues/744 to me (i.e., a known bug/deficiency).

As to why the query planner can’t use the index in #744 (and probably your example), I think it has less to do with whether the query component has the record type as much as just that the logic in the planner (which is trying to match the sort first and then the filter on the indexes) just isn’t quite sophisticated enough. There’s already some special handling of the record type information, though it could be added to more places.

The error message about queryability is essentially just saying that it needs to be a key expression that implements QueryableKeyExpression. That is used I think by our FunctionKeyExpressions or something like that.

What is the index you’ve defined? And is it defined on just a single type? If it’s a single type index, you can remove the record type field from your sort entirely. If it’s a multi-type index that’s “grouped” by record type, then that’s kind of the only query you can issue. You could also, if you want, issue a scan of the index yourself (with one of the .scanIndex methods on FDBRecordStore), and bypass the planner.

It seems like your aware of the tradeoffs, but just for any onlookers, it does look like this will do a full scan of the entire index (if it is “get me all records of all types sorted by type then order ID then price, then filter out the indexes with the wrong prices and order IDs”). That’s fine if this is like an offline analytics query or something. It might cause problems if this query needs to be fast (in which case a separate, multi-type index on order_id might be required).