Best way to add an index on already-existing data?

Hi all!

What an exciting project! I’m excited to see what people will come up with. Specifically, there are some really exciting things you could do with distributed graph databases that haven’t really been possible with eventually consistent backends. (@wwilson - any chance of reviving https://github.com/wwilson/blueprints-foundationdb-graph
as linked at https://groups.google.com/forum/#!topic/gremlin-users/L6X0Wc2Sp2k ?)

One thing that I haven’t seen in the documentation is how one would go about creating an index like in https://apple.github.io/foundationdb/layer-concept.html#the-foundationdb-way if data already exists.

For instance, in that example, if Bob were added after the code was pushed to maintain an eye color index, people/bob/eye_color = blue would set eye_color/blue/bob = true in the same transaction, and all would be good. But what about Alice and Anne who were already in the database? We could run a transaction doing something like “for each person, set eye_color/$color/$person = true.”

But if there were a large number of people, or people had large values, it might not be feasible to do this in a single five-second transaction. (Particularly if you were building something like a full-text index!) Would you parallelize on boundaries? What if boundaries shifted mid-run? And if not boundaries, how would you determine how to break the range of people into correctly-sized chunks to parallelize?

Or would you forget about parallelization, just build the index serially, streaming in keys, and committing a transaction when elapsed time gets close to 5 seconds, then starting again from that key? You might end up re-indexing Bob in this case, but if indexing is idempotent that shouldn’t be a problem.

Also, do we just trust the client library to batch our writes efficiently? We’d be writing pretty randomly across the eye_color range as we reindex. Presumably the client node doesn’t need to worry about this until we commit the transaction, so it would keep things in memory until that point?

Many more questions to come, but thought this would be a good starting point for discussion!

Cheers,

Brenton

Interesting question. Building indexes from existing data (especially if you add the additional constraint that you want to do it from a live dataset while concurrently permitting reads and writes) can be somewhat tricky due to some of the constraints that you’ve already mentioned, but it also ends up being a necessary primitive that many sufficiently-advanced layers will end up needing to support schema evolution. I can go through a sketch here of how that might be done.

As you point out, sufficiently large data sets may take longer than five seconds to read, and additionally, a transaction can do at most 10 MB of writes (actually, probably best to keep them under 1 MB, if possible). And there is a further problem that this transaction will necessarily have a read conflict range that spans the entire data set, so any concurrent write will cause that index-building transaction to fail. (In theory, one could get around this by writing a key in FDB to “lock” the data, i.e., you would enforce client-side that any write first check to see if the lock key is set and then fail the write in your application rather than submitting it to be committed.) So, it becomes essentially a necessity for any layer that implements some kind of index maintenance scheme to also implement a multi-transaction index builder.

So, how do you do that? The simplest way would be (1) start a transaction, (2) read some number of keys starting from the beginning of the data range, (3) write the appropriate index keys for the keys read, (4) commit the result, (5) go back to (1), but if (4) succeeds read from the last key written rather than the beggining of the range. This will usually work, with a few caveats:

  1. There must be some mechanism to avoid using the unbuilt index to answer queries while simultaneously updating the index as data are inserted and deleted from the main data set. This usually necessitates being able to make indexes enter a “write-only” mode, and this state needs to be persisted somewhere, and then when the build completes, it can mark the index as readable by mutating that state.
  2. This will only do the right thing if index updates are idempotent. For simple indexes, that assumption holds, so the only issue is that index updates might end up duplicating work. (As in your example, we might write Bob’s eye color twice.) You can imagine indexes where that won’t work, though. For example, consider an index that counts how many times we see a value–a histogram index. (You might do this efficiently using our atomic mutation API.)
  3. If whatever is building the index dies midway through, the entire index build must be restarted.

To solve (2) and (3), you can be slightly more sophisticated by atomically writing a key with the your index updates that states how far through the index build you have gotten. That way, if the index building job dies part way through, it can be resumed from where it left off rather than the beginning. And then once that information is available, you can also use it to solve (2) by updating a non-idempotent write-only index if and only if the key being added is within the built range. If it’s outside the built range, the index builder will pick it up. If it’s within the range, then the initial write will index it. Either way, it is indexed exactly once.

So, the final question is then how to implement parallelism. Well, to do that, you can actually extend technique above to write multiple keys instead of just one saying which ranges of the main dataset have been built. Then multiple index builders can work on different ranges at once and use those keys to avoid stepping on each other. The details are a little complicated, but that’s the general idea. If you did know roughly the distribution of your data, you can pick boundaries to shard upon. We also offer an API that can be used to find the boundary keys of the underlying shards. Note that the boundaries are not required for correctness, just performance, so once you’ve picked boundaries, you can just keep using the same ones for the entirely of the build (or even change them at will depending on how you keep track of the progress of each ranged), and things will just “work”.

As to your question about the client batching writes efficiently, note that all FDB writes are completely client-local until the transaction is committed, at which point all writes are batched together and submitted. Until then, all writes exist only in memory, so locality shouldn’t matter too much. (This is another reason–somewhat related to the 10 MB transaction size limit–that rebuilding an entire index in one transaction might not be possible.) Once the transaction is submitted, your writes will have to be split apart again and might end up on a variety of different storage servers, but that usually isn’t too bad.

1 Like

Hi Brenton,

(1) The Blueprints-compatible FDB graph layer that you linked was more of a prototype/demo than a production-quality layer. IIRC, Mike McMahon was working on a FoundationDB storage engine for TitanDB at one point, and I believe got it to a higher level of maturity. I have no idea whether Titan is still a popular database or not, but either way I don’t think it would be too hard to write a proper graph layer for FDB. I’d be happy to help and/or advise on any such effort.

(2) Your question about indexes, and Alec’s thorough but slightly daunting answer, is one reason I’d like to resurrect the document layer (as discussed here). It had a pretty fast and sophisticated implementation of the procedure Alec describes, and one mostly decoupled from the details of Mongo compatibility, and even from the document data model (IIRC, the indexing system only cared about hierarchical “field” or “paths”, and had a pluggable type system).