Interesting question. Building indexes from existing data (especially if you add the additional constraint that you want to do it from a live dataset while concurrently permitting reads and writes) can be somewhat tricky due to some of the constraints that you’ve already mentioned, but it also ends up being a necessary primitive that many sufficiently-advanced layers will end up needing to support schema evolution. I can go through a sketch here of how that might be done.
As you point out, sufficiently large data sets may take longer than five seconds to read, and additionally, a transaction can do at most 10 MB of writes (actually, probably best to keep them under 1 MB, if possible). And there is a further problem that this transaction will necessarily have a read conflict range that spans the entire data set, so any concurrent write will cause that index-building transaction to fail. (In theory, one could get around this by writing a key in FDB to “lock” the data, i.e., you would enforce client-side that any write first check to see if the lock key is set and then fail the write in your application rather than submitting it to be committed.) So, it becomes essentially a necessity for any layer that implements some kind of index maintenance scheme to also implement a multi-transaction index builder.
So, how do you do that? The simplest way would be (1) start a transaction, (2) read some number of keys starting from the beginning of the data range, (3) write the appropriate index keys for the keys read, (4) commit the result, (5) go back to (1), but if (4) succeeds read from the last key written rather than the beggining of the range. This will usually work, with a few caveats:
- There must be some mechanism to avoid using the unbuilt index to answer queries while simultaneously updating the index as data are inserted and deleted from the main data set. This usually necessitates being able to make indexes enter a “write-only” mode, and this state needs to be persisted somewhere, and then when the build completes, it can mark the index as readable by mutating that state.
- This will only do the right thing if index updates are idempotent. For simple indexes, that assumption holds, so the only issue is that index updates might end up duplicating work. (As in your example, we might write Bob’s eye color twice.) You can imagine indexes where that won’t work, though. For example, consider an index that counts how many times we see a value–a histogram index. (You might do this efficiently using our atomic mutation API.)
- If whatever is building the index dies midway through, the entire index build must be restarted.
To solve (2) and (3), you can be slightly more sophisticated by atomically writing a key with the your index updates that states how far through the index build you have gotten. That way, if the index building job dies part way through, it can be resumed from where it left off rather than the beginning. And then once that information is available, you can also use it to solve (2) by updating a non-idempotent write-only index if and only if the key being added is within the built range. If it’s outside the built range, the index builder will pick it up. If it’s within the range, then the initial write will index it. Either way, it is indexed exactly once.
So, the final question is then how to implement parallelism. Well, to do that, you can actually extend technique above to write multiple keys instead of just one saying which ranges of the main dataset have been built. Then multiple index builders can work on different ranges at once and use those keys to avoid stepping on each other. The details are a little complicated, but that’s the general idea. If you did know roughly the distribution of your data, you can pick boundaries to shard upon. We also offer an API that can be used to find the boundary keys of the underlying shards. Note that the boundaries are not required for correctness, just performance, so once you’ve picked boundaries, you can just keep using the same ones for the entirely of the build (or even change them at will depending on how you keep track of the progress of each ranged), and things will just “work”.
As to your question about the client batching writes efficiently, note that all FDB writes are completely client-local until the transaction is committed, at which point all writes are batched together and submitted. Until then, all writes exist only in memory, so locality shouldn’t matter too much. (This is another reason–somewhat related to the 10 MB transaction size limit–that rebuilding an entire index in one transaction might not be possible.) Once the transaction is submitted, your writes will have to be split apart again and might end up on a variety of different storage servers, but that usually isn’t too bad.