The pk
field is the primary key of the record being indexed. So, if you had a record like:
{
"id": 1066L,
"text": "My car and my dog and mine, my Car is color blue"
}
where the id
field was the primary key and the text index was on the text
field, then your example laying out all of the keys and values stays the same except that pk
is replaced with 1066L
in each kv-pair.
The prefix
is some subspace prefix. In the record layer, each record store is assigned a subspace (in some sense, it is defined by its subspace and meta-data), and within a record store, each index gets its own sub-subspace to avoid two different indexes writing over each other. (I think the document layer does something different, but substitute “record store” with “collection”.) It doesn’t really matter what it is; it’s just important that it exists and is unique.
The actual pseudocode for the indexing work would be something like:
@fdb.transactional
def index_record(tr, record):
subspace = get_subspace_for_index()
pk = record.get_primary_key()
text = extract_text(record)
token_list = tokenize(text)
token_to_positions_map = get_position_list_for_each_token(token_list)
for token, position_list in token_to_positions_map:
key = subspace.subspace((token,)).pack(pk)
value = serialize(position_list)
tr[key] = value
A few caveats here:
- This doesn’t handle deletes/updates
- The actual data storage format for the text index in the record layer combines adjacent keys within a single token in order to save space, but the format laid out above is equivalent from an API point of view (with the complexities hidden behind the BunchedMap class).
The offsets are for satisfying phrase search queries and proximity search queries. In particular, there are containsPhrase
and containsAll
within a distance predicates that make use of those offset lists.
If you are searching for a single term (like “car”), then these offsets are of essentially no use (except maybe to do highlighting). In that case, the index is scanned for all keys beginning with (prefix, "car"
) and the primary keys can be extracted from the index. Then the actual records can be retrieved by primary key (or if the primary key is enough, that can be returned).
If you are searching for a phrase (like “my car”), then the index can be queried by scanning everything beginning with (prefix, "my")
and everything beginning with (prefix, "car")
in parallel. Then the two scans are first combined using something like merge-join in a traditional database, and any place where the scans intersect corresponds to a record where all of the searched for tokens are in the document. Then one can inspect the position lists to ensure everything is in the right order. Here “my” has positions [0, 3, 7]
and car has positions [1, 8]
. For phrase search, you are looking for a position in “car” that is exactly 1 greater than a position of “my”. Here, “my” at position 0 and “car” at position 1 satisfies the predicate, and the primary key of that record is returned (1066, if using my sample record above). A similar procedure can be done for proximity search, but it checks instead if the minimum span between the position lists is less than or equal to the span given as a parameter to the query.
Note that there is an index option, TEXT_OMIT_POSITIONS_OPTION
, which can be used to stop the index from writing the positions list. This can be used as a space-savings optimization if one doesn’t need to perform those kinds of queries.
As described, no, the index can’t satisfy suffix or infix queries like that. It only supports prefix matching and full-token matching. (This is essentially an artifact of the kinds of range scans that the FDB key-value store supports.) However, there are a couple different solutions that allows the user to extend the text index so that it can do those more powerful kinds of queries, most of which depend on the fact that the record layer text solution allows for pluggable tokenizers. You could:
- Write a tokenizer that calls “reverse” on each token before inserting into the index. This can let you satisfy “suffix” queries. In the case of “understand”, it has been inserted into the database as “dnatsrednu”, so a prefix search for “dnats” will find that document. Note that this uses the same amount of space as a regular index, but it only supports suffix search, not infix search.
- Write a tokenizer that inserts all suffixes of each token. So “understand” gets inserted as “understand”, “nderstand”, “derstand”, “erstand”, “rstand”, “stand”, “tand”, “and”, “nd”, and “d”. (Maybe you don’t insert the smallest tokens to save space.) Then any prefix scan over the index returns any document with the searched for string in the interior of a token in the original document. So “stand” will match, as will “sta”, “ders”, “ndersta”, etc. Note that for a token of length n, there will be n keys inserted into the index with an order of n2 characters in all inserted tokens together, but each infix query can be satisfied with a single range scan.
- Write a tokenizer that, say, writes all tri-grams, i.e., substrings of length 3. So “understand” becomes “und”, “nde”, “der”, “ers”, “rst”, “sta”, “tan”, and “and”. Then an infix search for “stand” becomes a phrase search for “sta”, “tan”, and “and”. Note that each token of length n will be spread across n - 2 keys, but as each inserted key-value pair is only 3 characters in length, it will insert a number of characters (across all tokens) that is linear with the size of the original token. The tradeoff is that to find an infix of size k, it must perform k - 2 scans (in parallel) and filter out misses, which probably means more data read than the option above.
Two more things to note: in addition to the index maintainer taking a tokenizer option, text predicates in the record layer also take an optional tokenizer, and you can specify different behavior for the tokenizer at query and at index time. This can be used to make sure the same (or compatible) normalization and tokenization options get applied to query strings as are done to the indexed data, and it also means that it will only match the predicate against indexes with compatible tokenization strategies. (For example, if one went with the “reversing” tokenizer, the user can still provide their strings in forward order and the tokenizer will reverse them for them.)