Secondary Index on the presence of an optional message field

First of all, sorry for the basic question, but this was not clear to me from the documentation. If I have a top-level message/record with optional message fields (e.g. in a oneof)

message TopLevel {
   <other fields>
   message A {}
   message B {}
   message C {}
   oneof subtype {
     A a = n;
     B b = n+1;
     C c = n+2;
   } 
}

and I wanted to support a Query like (e.g. in Java)

RecordQuery.newBuilder().
        setRecordType("TopLevel".
        setFilter(
            Query.and(
                <some condition on other fields>,
                Query.not(Query.field("a").isNull())
            )).build();

with a secondary index, would I merely create a value type index on the field like below?

metadataBuilder.addIndex("TopLevel", new Index("TopLevel$a", field("a"), IndexTypes.VALUE));

I.e. is the proto bytestring U null indexed in this case?

P.s. I noticed that there is an Index constructor that takes a valueExpression argument, but no documentation is provided.

tagging @alloc for visibility, thanks!

Oh, thanks. I missed this one the first time around.

By default, missing optional fields are included as explicit nulls in the index. This is to support the .isNull() predicate (that is, by including the nulls, it lets it find all of the records where that field is not set).

One other note: you may have more success with the query predicate Query.field("a").notNull() instead of Query.not(Query.field("a").isNull()). Those are logically equivalent, but the query planner may be able to handle the former a bit better than the latter, though I could be wrong (the normalization process may fix that for you, in which case they’re the same). But if the query seems slow/doesn’t use the index, making that change to the query may help.

The “valueExpression” there refers to data that gets indexed, but only in the value portion of the key-value pairs that make up the index. That’s where the name comes from, but from a semantics perspective, the best way to think of it is that the data in the key portion of the index expression (that is, not in the valueExpression) determines how the index data are sorted, and so only data in the key portion of the expression can be used to satisfy ordering expression. Additionally, predicates can also be satisfied by scanning relevant ranges of the index, but those optimizations can only be performed on elements in the key. You can still put predicates on fields in the value portion of the expression, but those will always be satisfied by scanning the index entry and then executing the predicate as residual filter on the index entry.

The main reason you’d put data into the valueExpression is to make the index cover additional columns and avoid looking up the base records. By default, the result of a query is always the full record, but if you use the .setRequiredResults option on the RecordQuery, the Record Layer can sometimes return partial results where only the fields found in the index are available, and the planner also guarantees that the fields in the required-results section are present. (It becomes an API misuse to reference another field, though this isn’t really enforced by the returned Message, so it can be dangerous if you don’t do something like martial the Message into a data type that makes it clear what is and isn’t available.)

I believe the use of separate key and value expressions in index constructors is currently deprecated, though you can still use a KeyWithValueExpression to achieve the same result. If you do something like:

Key.Expressions.keyWithValue(Key.Expressions.concatenateFields("foo", "bar", "baz"), 2)

Then this can be used to construct an index sorted by the first two fields ("foo" and "bar") with the remaining field ("baz") in the value. So could then use this query to satisfy something like:

RecordQuery.newBuilder()
    .setFilter(Query.field("foo").equalsValue("a"))
    .setSort(Key.Expressions.field("bar"))
    .setRequiredResults(List.of(Key.Expressions.field("baz"))
    .build();

This will then get executed as a single covering index scan where the first column is set to "a", and it will return records where only the "baz" field is guaranteed to be set.

1 Like