Proposal for a `fdb.range_near(key, limit)`

While working on approximate string matching (https://stackoverflow.com/q/58065020/140837) I stumbled upon the need to lookup the nearest keys around an input key. Hence the idea to have a fdb.range_near(key, limit).

A workaround could be the query by prefixes of key of decreasing length until reaching limit found keys or the empty prefix. That seems like a waste because results of db.range_prefix(key[:len(key) - n] are included in db.range_prefix(key[:len(key) - n - 1] (where n is strictly bigger than the length of the associated subspace prefix)

Maybe RankedSet will be a better solution?

Could you explain precisely what you would wish the semantics of range_near to be?

I wish to be able to fetch at most LIMIT key-value pairs around KEY. For instance,
given the following database:

+=========+=========+
|   key   |  value  |
+=========+=========+
| b'\x00' |  "foo"  |
+---------+---------+
| b'\x01' |  "abc"  |
+---------+---------+
| b'\x02' |  "qux"  |
+---------+---------+
| b'\x03' |  "bar"  |
+---------+---------+
| b'\x04' |  "baz"  |
+=========+=========+

I would like, the following python code:

db.range_near(b'\x02', limit=3)

To return the following key-value pairs:

+=========+=========+
|   key   |  value  |
+=========+=========+
| b'\x01' |  "abc"  |
+---------+---------+
| b'\x02' |  "qux"  |
+---------+---------+
| b'\x03' |  "bar"  |
+=========+=========+

Would something like key-selectors help? For your previous example, we could fetch n/2 keys above and n/2 below the pivot key and then do another call to make up for any shortfall.

Yes, indeed key selectors can help.

Thanks.

That will not necessarily be the nearest keys. To be sure, that the returned keys are the nearest, one will need to fetch LIMIT keys above and LIMIT keys below, and compute the nearest keys. That is not bad. In my case, LIMIT is not more than 10.

(By the way, I am still not sure about the “approximate string matching” algorithm I wrote about in the original post).

Thanks!

Yes, if there is a notion of ‘nearness’ then one would need to fetch n keys on each side. I was trying to fetch n keys ‘around’ the pivot key, as mentioned in the example. I will read the original post more carefully and see if there is anything possible for it.

Do not put too much effort in this except if there is a large interest
for that feature, because for my use case, selectors work very well.

Also, just in case someone wants to experiments with the “approximate
string matching” algorithm I am thinking about. Here is a example run:

fuzzbuzz$ cat China-modified | python fuzz.py query
* most similar according to bbk fuzzbuzz
** wikipedia-vital-articles-level-3/Geography/Countries/China 	 0
** wikipedia-vital-articles-level-3/People/Politicians and leaders/Mao_Zedong 	 -71
** wikipedia-vital-articles-level-3/Society and social sciences/Language/Spanish_language 	 -84
** wikipedia-vital-articles-level-3/Society and social sciences/Language/Chinese_language 	 -89
** wikipedia-vital-articles-level-3/Science/Earth science/Rain 	 -92
** wikipedia-vital-articles-level-3/Geography/Hydrologic features/Yangtze 	 -95
** wikipedia-vital-articles-level-3/Geography/Countries/Mexico 	 -96
** wikipedia-vital-articles-level-3/History/Ancient history/Achaemenid_Empire 	 -96
** wikipedia-vital-articles-level-3/Geography/Cities/Mexico_City 	 -97
** wikipedia-vital-articles-level-3/Society and social sciences/Politics and government/International_Red_Cross_and_Red_Crescent_Movement 	 -97
* bbk fuzzbuzz computed the hamming distance against: 73 documents (out of 995 documents in total)

The China-modified document is a modified version of the first found document named China. The algorithm computed the hamming distance against 73 documents and kept the top 10.

The algorithm looks like the following:

  • Compute the simhash (in the current test it use 256bits pyblake2 hash, maybe CityHash would be better)
  • Split the simhash bitstring into 4 chunks of equal length
  • For each chunk, compute the merkel-tree with OR binary operator
  • Serialize the merkel-tree using a top-down depth-first order
  • Add as a suffix the bitstring chunk

For instance, 0b1010 will result in a merkel tree, that looks like:

     1
   /   \
  1     1
 / \   / \
[1][0][1][0]

The top-down traversal will result in the following bitstring:

'0b1110110'

Then add the original chunk to the above to obtain:

'0b11101101010'

That last bitstring captures more similarity that the original
chunk. That is, the smaller the edit Hamming distance between two
simhashes, the bigger the common prefix in the resulting bitstring.

The above output was genereated by a prototype that is not really
approximate string matching, but more like, semantic similarity
because it rely on bag-of-stems to compute the simhash:
https://github.com/amirouche/fuzzbuzz/blob/master/fuzz.py#L126 (the code is not nice).