Supporting Strings with case insensitive comparisons but case preserved in data

We need to persist strings in database and retrieve those strings in the same case as inserted.
But, iteration on such strings as keys (ex: in modelling indexes) should be in case-insensitive sorting order.

To explain this in detail, consider below examples:

In below pseudo codes, consider [ ] as tuple layer encoding and CIS as some datatype called “Case Insensitive String” somehow solving our use case.

(1) User should be able to fetch stored data in case insensitive fashion:

txn.set( [ CIS( “myKey” ) ], <Data> );
range = new Range.startsWith( [ CIS( “MYKeY” ) ] );
txn.getRange( range ) == List{ ( CIS( “myKey” ), <Data> ) }

// Note: The above range statement is same as:
range = new Range( [ CIS( “MYKeY” ) ], strinc( [ CIS( “MYKeY” ) ] ) );

(2) In key-prefix based range queries, user should be able to decode key data in the same case as it was inserted, but data should retrieved in case-insensitive sorting fashion:

txn.set( [ CIS( “myKey1” ) ], <Data1> );
txn.set( [ CIS( “MyKey2” ) ], <Data2> );
range = new Range( ‘\0x00’, ‘\0xff’ ); // get entire subspace
rangeRead = txn.getRange( range );

// Even though ‘m’ > ‘M’, results are returned in case insensitive sort order
rangeRead == List{ ( CIS ( “myKey1” ), <Data1> ), ( CIS ( “MyKey2” ), <Data2> ) };

// There is a way to get strings in original (preserved) case by some API, say originalString()
rangeRead.map(x → x.key().originalString() ) == List{ “myKey1”, “MyKey2” ) }

If we already have a solution for this, please point me to the same. If not, please read below proposed sol’n.

Sol 1: Keep 2 copies
Store 2 copies of each string, (1) in lowercase and (2) in originally inserted case. We can then use (1) for key encoding and (2) for returning data to client. Mapping of (1) → (2) needs to be modeled here.

This increases key size by 2x which is going unacceptable for us, considering our large indexed data.

Sol 2: A new Datatype

Fortunately, string’s characters for our use case are printable ASCII characters (32 to 126) only. Based on this assumption, here is a solution below.

Tuple Layer already has a datatype “String” which effectively uses ASCII table to map character to byte. For solving our use-case, we introduce a datatype CIS in tuple layer with the same approach as “String” but with different mapping table:

(Space) => 0
! => 1
...(skipped)....
A => 33
a => 34
B 
b 
..(skipped)...
Z
z 
[ 
..(skipped)....
~ => 94

Using this mapping, set() is straightforward. But for read(), we use all upper-case in start range and all lower-case in end range.

(1) fetch stored data in case insensitive fashion:

txn.set( [ CIS ( “myKey” ) ], <Data> );

start = [ CIS ( “MYKeY” ).upperCase() ];
end = strinc( [ CIS ( “MYKeY” ).lowerCase() ] );
txn.getRange( new Range( start, end ) ) == List{ ( CIS ( “myKey” ), <Data> ) }

(2) decode key in same case but retrieve data in case-insensitive sorting:

txn.set( [ CIS ( “myKey1” ) ], <Data1> );
txn.set( [ CIS ( “MyKey2” ) ], <Data2> );
rangeRead = txn.getRange( new Range( ‘\0x00’, ‘\0xff’ ) );

rangeRead == List{ ( CIS ( “myKey1” ), <Data1> ), ( CIS ( “MyKey2” ), <Data2> ) };

rangeRead.map(x → x.key().originalString() ) == List{ “myKey1”, “MyKey2” ) }

Below is a more complicated example of cases where one models composite indexes based on tuples:

// Assume that we have a dataset of people with composite index
// on columns: [ city, age, company, name ], inserted as:
txn.set( [ CIS ( “city” ), <age>, CIS ( “company” ), CIS ( “name” ) ], <empty> );

// To iterate over all people in “Sydney” aged 30 in “Google”, we do:
start = [ CIS ( “Sydney” ).upperCase(), 30, CIS ( “Google” ).upperCase() ];
end = strinc( [ CIS ( “Sydney” ).lowerCase(), 30, CIS ( “Google” ).lowerCase() ] );
cursor = txn.getRange( new Range( start, end ) );

Additionally:

  • For deletion, we need to do: clearRange( upperCase, lowerCase )
  • Before every set( key ), we need to do: clearRange( upperCase, lowerCase )

It would be great if fdb team can provide feedback on this approach.

  • Does this approach sound right or is there a simpler way to solve this?
  • If it sounds right, could you point me to loop-holes/bugs in this approach, if any?
  • Is this very specific to our use case or could this be generalized & contributed to fdb?

I’ve heard CouchDB people discussing a similar issue, where a different bytestring is needed for proper sorting than the string itself. (+cc @garrensmith or @rnewson) I believe their solution was to store two copies of the string.

Unless I’m missing something, between AAA and aaa is XYZ, so I don’t think doing this range read is going to work, as you’ll read about half your data set each time?

If all you really need is just to re-establish the proper case of the inserted ASCII-only-and-never-UTF8-string, then I’d probably just uppercase the key, and keep a bitvector in the value of which characters are supposed to be uppercase. When you need the actual string, you merge the key with the bit vector to produce the appropriately cased string.

If one were to introduce CIS for keys, you could technically suffix this bitvector onto the end of the key, and do a range read, but the two approaches would be essentially equivalent.

A CouchDB index (or “view” in our parlance) has a defined collation order that we are obliged to preserve as we move to a FoundationDB backend. Collation order is (obviously) defined rigidly for keys in FDB as the byte ordering, with no API to alter that.

We use ICU sort keys (http://userguide.icu-project.org/collation/concepts#TOC-Sortkeys-vs-Comparison) to convert any strings emitted by an index definition into a (longer) value that compares byte-for-byte in the same way that the ICU collator we currently use compares.

This isn’t strictly about case-folding, as I don’t recognise that as a CouchDB problem (if the user wants case-insensitive search today, they’d have to call toLowerCase() in their map function).

The other indexes we support at IBM Cloudant (and this will also be true for CouchDB 3.0) is Lucene indexing, which does (typically) perform case-folding. Cloudant’s solution there is https://github.com/cloudant-labs/fdblucene, which simply uses FDB as a file store, and so we didn’t need to alter our approach on case-insensitivity there either.

This is the long way to say that I don’t recall the discussion you are referring to, but hopefully the above was useful even if not relevant to the problem at hand.

XYZ does not come between AAA and aaa in the character to byte mapping I proposed. It is not based on ASCII mapping. As per the mapping given by me in sol’n:

  • AAA = (byte) 33, (byte) 33, (byte) 33
  • aaa = (byte) 34, (byte) 34, (byte) 34
  • XYZ = (byte) 79, (byte) 81, (byte) 83

So AAA < aaa <XYZ.

Ah, right, yes. Apparently that didn’t sink in, sorry.

You’re still going to have troubles with this for compound keys though, right? Using your more complicated example as a template to define three tuples:

a = [  CIS( “CITY” ), 20,  CIS( “company” ),  CIS( “name” ) ]
b = [  CIS( "City" ), 30,  CIS( "farm" ), CIS( "walrus" ) ]
c = [  CIS( "city" ), 20, CIS( "company" ), CIS( "name" ) ]

then when your composite index is being searched for ( CIS("city"), 20, CIS("company"), _ ), you’d want to only get a and c, but you’d also get b. This would mean that you could only have one CIS per index, if you wish for it to be precise, and all non-CIS keys would need to precede it in the index. That may or may not be an actual problem for you.

Ah! yes! So for such use cases, bitvector approach is the answer.

Thanks a lot.

I’ll also say that the Record Layer also recently added support for collation indexes using the CollationFunctionKeyExpression with pluggable collator support (with default implementations using the native tools in the Java runtime environment and ICU4J).

You can see the implementation here: https://github.com/FoundationDB/fdb-record-layer/pull/307/files

This takes the approach of converting the string to a byte array where the unsigned byte order of the string is consistent with your desired sort order (so, in your case, converting all characters to lower case and then presumably UTF-8 or ASCII encoding). By default, that is the only thing that goes into the index, so to get the original string back, you can look up the source record from the index. One can also add additional columns to the index (including in the value) and then the Record Layer could use that to retrieve the original string without needing to resolve the original record (at the cost of extra storage).


As to comments on the original approach presented, I think it also has the following problem. Consider the following three strings:

  • STRING
  • SUPER
  • string

If I’m understanding the proposal correctly, then I think CIS("STRING") < CIS("SUPER") < CIS("string") as CIS('S') < CIS('s'). So if you had those three strings all in your index, then sorting by their CIS’d values will produce the wrong order (with “STRING”, “string”, “SUPER” expected). This would also cause the proposed range deletes to not work properly in this case, as a deletion of the “string” key (case insensitively) would also delete “SUPER”.

Unless I’m mistaken.

I’m not sure if it will help much in your case. But here is CouchDB’s encoding source code

We use ICU sort strings to sort all strings correctly. We then keep the original keys in the value so that they can be returned back to the user.

Thanks for all the inputs. Looks like the ultimate solution is to have 2 copies:

  1. lowecase string
  • which goes in index (keys) for sorting
  1. original string
  • which can be optimally stored as delta from #1 using BitVector approach

This approach has two getRange() calls on separate ranges while iterating index.

This approach requires extra storage, but in BitVector approach, this might not be much.

Also, in indexes, storing BitVector/original string in Value rather than Key will ensure that we don’t disturb the overall sorting order of the index. If we are storing it in index itself (Key), we need to be cautious of sorting order changes.

Code references of Record Layer and CouchDb are very helpful here. Thanks :slight_smile: