FoundationDB

Understanding text index on fdb record layer


(Wolf Dan) #1

Hey!

I’m trying to understand how the text index works on the record layer, I would like to implement it on my layer, but some things are not clear to me

Based on the paper the index is saved on the database like this

( p r e f i x , t o k e n 1 , pk 1 ) → o f f s e t s 1

So for example the given string "My Car and my dog and mine, my Car is color blue", the result will be something like this

   (prefix, "my", pk) = [0, 3, 7],
   (prefix, "car", pk) = [1, 8],
   (prefix, "and", pk) = [2],
   (prefix, "dog", pk) = [4],
   (prefix, "mine", pk) = [6],
   (prefix, "is", pk) = [9],
   (prefix, "color", pk) = [10],
   (prefix, "blue", pk) = [11]

So my questions are…

  • What’s prefix and pk exactly? From what I understand the pk is the record that is going to be index id, but not sure about the prefix
  • How the index query exactly works and what’s the offset for? I understand that you can get the word by querying with the prefix and the token itself or a partial part of the token, so (prefix, "car") will get [1, 8] and (prefix, "c") will get [1, 8, 10], but here what’s the use of the offsets?
  • In the same topic, what about querying a contains function? for example if I give the query stand and we have the token understand that match that contains, this index match only the full word or can match this case as well?

This can be explicit on the Java code, but is quite big and I’m not good at reading it as wel… Sorry for that ^^’

Thank you!


(Alec Grieser) #2

The pk field is the primary key of the record being indexed. So, if you had a record like:

{
   "id": 1066L,
   "text": "My car and my dog and mine, my Car is color blue"
}

where the id field was the primary key and the text index was on the text field, then your example laying out all of the keys and values stays the same except that pk is replaced with 1066L in each kv-pair.

The prefix is some subspace prefix. In the record layer, each record store is assigned a subspace (in some sense, it is defined by its subspace and meta-data), and within a record store, each index gets its own sub-subspace to avoid two different indexes writing over each other. (I think the document layer does something different, but substitute “record store” with “collection”.) It doesn’t really matter what it is; it’s just important that it exists and is unique.

The actual pseudocode for the indexing work would be something like:

@fdb.transactional
def index_record(tr, record):
    subspace = get_subspace_for_index()
    pk = record.get_primary_key()
    text = extract_text(record)
    token_list = tokenize(text)
    token_to_positions_map = get_position_list_for_each_token(token_list)
    for token, position_list in token_to_positions_map:
        key = subspace.subspace((token,)).pack(pk)
        value = serialize(position_list)
        tr[key] = value

A few caveats here:

  1. This doesn’t handle deletes/updates
  2. The actual data storage format for the text index in the record layer combines adjacent keys within a single token in order to save space, but the format laid out above is equivalent from an API point of view (with the complexities hidden behind the BunchedMap class).

The offsets are for satisfying phrase search queries and proximity search queries. In particular, there are containsPhrase and containsAll within a distance predicates that make use of those offset lists.

If you are searching for a single term (like “car”), then these offsets are of essentially no use (except maybe to do highlighting). In that case, the index is scanned for all keys beginning with (prefix, "car") and the primary keys can be extracted from the index. Then the actual records can be retrieved by primary key (or if the primary key is enough, that can be returned).

If you are searching for a phrase (like “my car”), then the index can be queried by scanning everything beginning with (prefix, "my") and everything beginning with (prefix, "car") in parallel. Then the two scans are first combined using something like merge-join in a traditional database, and any place where the scans intersect corresponds to a record where all of the searched for tokens are in the document. Then one can inspect the position lists to ensure everything is in the right order. Here “my” has positions [0, 3, 7] and car has positions [1, 8]. For phrase search, you are looking for a position in “car” that is exactly 1 greater than a position of “my”. Here, “my” at position 0 and “car” at position 1 satisfies the predicate, and the primary key of that record is returned (1066, if using my sample record above). A similar procedure can be done for proximity search, but it checks instead if the minimum span between the position lists is less than or equal to the span given as a parameter to the query.

Note that there is an index option, TEXT_OMIT_POSITIONS_OPTION, which can be used to stop the index from writing the positions list. This can be used as a space-savings optimization if one doesn’t need to perform those kinds of queries.

As described, no, the index can’t satisfy suffix or infix queries like that. It only supports prefix matching and full-token matching. (This is essentially an artifact of the kinds of range scans that the FDB key-value store supports.) However, there are a couple different solutions that allows the user to extend the text index so that it can do those more powerful kinds of queries, most of which depend on the fact that the record layer text solution allows for pluggable tokenizers. You could:

  1. Write a tokenizer that calls “reverse” on each token before inserting into the index. This can let you satisfy “suffix” queries. In the case of “understand”, it has been inserted into the database as “dnatsrednu”, so a prefix search for “dnats” will find that document. Note that this uses the same amount of space as a regular index, but it only supports suffix search, not infix search.
  2. Write a tokenizer that inserts all suffixes of each token. So “understand” gets inserted as “understand”, “nderstand”, “derstand”, “erstand”, “rstand”, “stand”, “tand”, “and”, “nd”, and “d”. (Maybe you don’t insert the smallest tokens to save space.) Then any prefix scan over the index returns any document with the searched for string in the interior of a token in the original document. So “stand” will match, as will “sta”, “ders”, “ndersta”, etc. Note that for a token of length n, there will be n keys inserted into the index with an order of n2 characters in all inserted tokens together, but each infix query can be satisfied with a single range scan.
  3. Write a tokenizer that, say, writes all tri-grams, i.e., substrings of length 3. So “understand” becomes “und”, “nde”, “der”, “ers”, “rst”, “sta”, “tan”, and “and”. Then an infix search for “stand” becomes a phrase search for “sta”, “tan”, and “and”. Note that each token of length n will be spread across n - 2 keys, but as each inserted key-value pair is only 3 characters in length, it will insert a number of characters (across all tokens) that is linear with the size of the original token. The tradeoff is that to find an infix of size k, it must perform k - 2 scans (in parallel) and filter out misses, which probably means more data read than the option above.

Two more things to note: in addition to the index maintainer taking a tokenizer option, text predicates in the record layer also take an optional tokenizer, and you can specify different behavior for the tokenizer at query and at index time. This can be used to make sure the same (or compatible) normalization and tokenization options get applied to query strings as are done to the indexed data, and it also means that it will only match the predicate against indexes with compatible tokenization strategies. (For example, if one went with the “reversing” tokenizer, the user can still provide their strings in forward order and the tokenizer will reverse them for them.)


(Wolf Dan) #3

Thank you so much!!!

That’s exactly what I was looking for, for my problem I think the second solution in the most suited for it, tho delete and update can be a bit expensive/problematic

I didn’t get my head around with all those papers etc… the explanations where perfect for me to understand all the concept!

thank you once more :smiley: