My data model requires an array with insert and delete capabilities. In the FDB documentation there is a description of how to model an array, but it doesn’t include insertion and deletion. I have not been able to find mention of any good algorithms for this. Here is what I’m going to do and would appreciate any feedback if anyone has a better way.
My API needs to do the following:
Read entire array
For this, just do a range read to access all array elements
Delete at index
Clear the element at an offset from the first array element
Append
Append items with this key: [array key] + [version stamp]. Since the version stamps are increasing values, this will sort the array elements in the correct order.
Insert at index
This is the tricky one. When an element needs to be inserted between two other elements, generate a key with a version stamp that is halfway between the previous element and the following one. If inserting at location zero, use a version stamp that is somewhat less than the former first element. If inserting at the end, just do an append as explained above.
My main concern is how quickly I could run out of space between two existing elements and not have room to insert an element. It appears that, including the UserVersion field at the end of the version stamp there are about 4 bytes of least-significant bytes between successive version stamps. That would appear to be plenty since there are over 4 billion values in 4 bytes. However, because this algorithm is choosing a key halfway between the previous and next key, it’s using up an entire bit of the 32 available bits each time you insert at the same index. So there is potential for a collision after 32 inserts at the same index. The only solution at that point seems to be to rewrite every key in the array, spacing them out appropriately.
Although this seems like a simple enough problem, I think you have stumbled on an iceberg of complexity, I wrestled with a similar issue when working on a text editor base on CRDTs (the Treedoc paper is good reading if you are interested in this space). Luckily FoundationDB’s consistency model makes CRDTs unnecessary.
I would recommend against using the versionstamp, and certainly against making versionstamp-like keys. Instead, you can use a trivial tree algorithm where the first item you write is 0, and if you need to insert an item before it, that item is 00, an item after would be 1. This should be pretty simple to do with key selectors.
The problem you will quickly run into is that if you are simply appending elements to the end of this array, your key size will grow linearly, i.e. the 100th item’s key will be 99 1s. To solve this you will need to occasionally balance your tree. There are a number of well-worn self-balancing tree if you are feeling like going down a rabbit hole, or you can simply rewrite the entire array with consecutive n-bit keys, where n^2 is greater than the number of items in the array. You can also mitigate this issue by adding “padding” when you choose identifiers to leave room for new identifiers that wouldn’t increase the key size.
Depending on how you define the “identity” of elements, you might run into some issues when the identifiers change and the choice of how to solve that issue will likely be domain-specific.
I would suggest not viewing this as an array data modeling problem, but just that you need a user-specified sort order, and the ability to randomly insert/delete elements and maintain your sorted order. This problem now becomes roughly isomorphic to How to model a Leaderboard
Similarly to what George proposed, I think one can go a step further and rely on the infinite number of things that are lexicographically between any two values. For example, if you have two elements at “index” X and Y, XM lexicographically sorts between X and Y, with a bit of randomness in choosing the in-between point and taking advantage of the full 256 values in a byte, you could still achieve reasonably conflict-free inserts/deletes. You would then need to maintain an index of the number of elements under any given prefix to make seeking a particular index faster.
Do be aware that “Insert X as the 100th element” is going to be an inherently conflict-prone operation, as you must also ensure that no other element was inserted or deleted between 0 and 99, otherwise you’d be inserting at a now-incorrect place. However you’re finding the 100th indexed location will likely leave enough conflict ranges in your transaction to enforce this, but minor changes in how you specify your exact contract of insert/delete will have a large impact on how concurrent this datastructure will be.