I’m part of a team in Cloudant that’s been researching solutions for a secondary text index that would accompany the primary CouchDb layer. We are using https://github.com/AydinSakar/lucene-layer as a basis for this investigation. The aforementioned project uses two approaches to create a layer on top of FDB:
- Write lucene index file segments as binaries in K/V chunks
- Implement a new set of lucene codecs specifically to divide the segment data structures into K/V pairs and store them in FDB
We modified lucene’s nightly benchmark to find a baseline for approach 2. Note the tests are done locally on a development machine. The first attempt did not work because the benchmark keeps an Index Writer open for the duration of indexing. The lucene-layer codec uses one transaction per index writer and thus for the duration of the indexing. Inevitably, the benchmark hits the 5 second timeout limit per transaction. To get pass this bottleneck, we modified the code to execute db.run(…) for every postings list for every term. As expected, the performance was terrible since we’re doing bulk commits serially. Also was verified by running the java bindings performance tester.
We ran the same benchmark using only binary chunks (Approach 1), and the results were 10 times faster, but still relatively slow compared to a normal lucene codec -> disk implementation.
The lucene-layer implementation has some issues that did not allow us to modify the chunk size, so we’re rewriting a bare bones for Approach 1 in an attempt to find an optimum transaction to chunk size. If our implementation is successful and we can see performance near lucene->disk times, then we’ll consider re-implementing the codec layer to further enhance performance.
My questions are:
During indexing, and from a fdb-java bindings perspective, we’re basically doing a massive bulk insert on scale. Reading through Best practices for bulk load mentions that “the client library is single-threaded.” My experimental changes from db.run(…) to db.asyncRun(…) had no performance gains so that supports the idea. Correct me if I’m wrong here, but I don’t think we can randomize the keys in this instance as we need to read the segment data sequentially. Can asyncRun can be used to perform index(bulk inserts) asynchronously?
The goal of the investigation is find whether a lucene on fdb solution is feasible. Does anyone see any theoretical and architectural limitations that we’re not accounting for?