We need to persist strings in database and retrieve those strings in the same case as inserted.
But, iteration on such strings as keys (ex: in modelling indexes) should be in case-insensitive sorting order.
To explain this in detail, consider below examples:
In below pseudo codes, consider [ ] as tuple layer encoding and CIS as some datatype called “Case Insensitive String” somehow solving our use case.
(1) User should be able to fetch stored data in case insensitive fashion:
txn.set( [ CIS( “myKey” ) ], <Data> );
range = new Range.startsWith( [ CIS( “MYKeY” ) ] );
txn.getRange( range ) == List{ ( CIS( “myKey” ), <Data> ) }// Note: The above range statement is same as:
range = new Range( [ CIS( “MYKeY” ) ], strinc( [ CIS( “MYKeY” ) ] ) );
(2) In key-prefix based range queries, user should be able to decode key data in the same case as it was inserted, but data should retrieved in case-insensitive sorting fashion:
txn.set( [ CIS( “myKey1” ) ], <Data1> );
txn.set( [ CIS( “MyKey2” ) ], <Data2> );
range = new Range( ‘\0x00’, ‘\0xff’ ); // get entire subspace
rangeRead = txn.getRange( range );// Even though ‘m’ > ‘M’, results are returned in case insensitive sort order
rangeRead == List{ ( CIS ( “myKey1” ), <Data1> ), ( CIS ( “MyKey2” ), <Data2> ) };// There is a way to get strings in original (preserved) case by some API, say originalString()
rangeRead.map(x → x.key().originalString() ) == List{ “myKey1”, “MyKey2” ) }
If we already have a solution for this, please point me to the same. If not, please read below proposed sol’n.
Sol 1: Keep 2 copies
Store 2 copies of each string, (1) in lowercase and (2) in originally inserted case. We can then use (1) for key encoding and (2) for returning data to client. Mapping of (1) → (2) needs to be modeled here.
This increases key size by 2x which is going unacceptable for us, considering our large indexed data.
Sol 2: A new Datatype
Fortunately, string’s characters for our use case are printable ASCII characters (32 to 126) only. Based on this assumption, here is a solution below.
Tuple Layer already has a datatype “String” which effectively uses ASCII table to map character to byte. For solving our use-case, we introduce a datatype CIS in tuple layer with the same approach as “String” but with different mapping table:
(Space) => 0
! => 1
...(skipped)....
A => 33
a => 34
B
b
..(skipped)...
Z
z
[
..(skipped)....
~ => 94
Using this mapping, set() is straightforward. But for read(), we use all upper-case in start range and all lower-case in end range.
(1) fetch stored data in case insensitive fashion:
txn.set( [ CIS ( “myKey” ) ], <Data> );
start = [ CIS ( “MYKeY” ).upperCase() ];
end = strinc( [ CIS ( “MYKeY” ).lowerCase() ] );
txn.getRange( new Range( start, end ) ) == List{ ( CIS ( “myKey” ), <Data> ) }
(2) decode key in same case but retrieve data in case-insensitive sorting:
txn.set( [ CIS ( “myKey1” ) ], <Data1> );
txn.set( [ CIS ( “MyKey2” ) ], <Data2> );
rangeRead = txn.getRange( new Range( ‘\0x00’, ‘\0xff’ ) );rangeRead == List{ ( CIS ( “myKey1” ), <Data1> ), ( CIS ( “MyKey2” ), <Data2> ) };
rangeRead.map(x → x.key().originalString() ) == List{ “myKey1”, “MyKey2” ) }
Below is a more complicated example of cases where one models composite indexes based on tuples:
// Assume that we have a dataset of people with composite index
// on columns: [ city, age, company, name ], inserted as:
txn.set( [ CIS ( “city” ), <age>, CIS ( “company” ), CIS ( “name” ) ], <empty> );// To iterate over all people in “Sydney” aged 30 in “Google”, we do:
start = [ CIS ( “Sydney” ).upperCase(), 30, CIS ( “Google” ).upperCase() ];
end = strinc( [ CIS ( “Sydney” ).lowerCase(), 30, CIS ( “Google” ).lowerCase() ] );
cursor = txn.getRange( new Range( start, end ) );
Additionally:
- For deletion, we need to do: clearRange( upperCase, lowerCase )
- Before every set( key ), we need to do: clearRange( upperCase, lowerCase )
It would be great if fdb team can provide feedback on this approach.
- Does this approach sound right or is there a simpler way to solve this?
- If it sounds right, could you point me to loop-holes/bugs in this approach, if any?
- Is this very specific to our use case or could this be generalized & contributed to fdb?