Lucene-Layer on FoundationDB

(Tony Sun) #1

Hi Folks,

I’m part of a team in Cloudant that’s been researching solutions for a secondary text index that would accompany the primary CouchDb layer. We are using as a basis for this investigation. The aforementioned project uses two approaches to create a layer on top of FDB:

  1. Write lucene index file segments as binaries in K/V chunks
  2. Implement a new set of lucene codecs specifically to divide the segment data structures into K/V pairs and store them in FDB

We modified lucene’s nightly benchmark to find a baseline for approach 2. Note the tests are done locally on a development machine. The first attempt did not work because the benchmark keeps an Index Writer open for the duration of indexing. The lucene-layer codec uses one transaction per index writer and thus for the duration of the indexing. Inevitably, the benchmark hits the 5 second timeout limit per transaction. To get pass this bottleneck, we modified the code to execute…) for every postings list for every term. As expected, the performance was terrible since we’re doing bulk commits serially. Also was verified by running the java bindings performance tester.

We ran the same benchmark using only binary chunks (Approach 1), and the results were 10 times faster, but still relatively slow compared to a normal lucene codec -> disk implementation.

The lucene-layer implementation has some issues that did not allow us to modify the chunk size, so we’re rewriting a bare bones for Approach 1 in an attempt to find an optimum transaction to chunk size. If our implementation is successful and we can see performance near lucene->disk times, then we’ll consider re-implementing the codec layer to further enhance performance.

My questions are:

  1. During indexing, and from a fdb-java bindings perspective, we’re basically doing a massive bulk insert on scale. Reading through Best practices for bulk load mentions that “the client library is single-threaded.” My experimental changes from…) to db.asyncRun(…) had no performance gains so that supports the idea. Correct me if I’m wrong here, but I don’t think we can randomize the keys in this instance as we need to read the segment data sequentially. Can asyncRun can be used to perform index(bulk inserts) asynchronously?

  2. The goal of the investigation is find whether a lucene on fdb solution is feasible. Does anyone see any theoretical and architectural limitations that we’re not accounting for?

(Alex Miller) #2

I would fully expect that calling db.asyncRun() and letting the transaction execute asynchronously in the background should result in better write throughput. A local test for me shows that running 1000 sequential transactions that set an increasing integer key to a 1KB value takes 9.95s, implying my local fsync() takes ~100ms. Converting that to run asynchronously on a thread pool causes it to take 1.296s. So I’m very surprised to hear that converting run to runAsync didn’t help, and I’d suggest taking a second look at the code to make sure you didn’t accidentally re-serialize the writes somehow. And to be fair, lucene definitely isn’t calling fsync() after each block write, as the sequential series of transactions is doing.

With “Note the tests are done locally on a development machine”, do you mean that you’re running a benchmark against the default created FoundationDB instance? As you might have noticed from the numbers above, the performance is terrible. With the FDB activity going on around Cloudant, is there anyone with a small FDB cluster that you could borrow a subspace on? Our performance characteristics get much better as soon as one can separate the transaction logs from the storage servers, and being closer to something that resembles a production cluster is generally nice when doing performance work.

The importance of the single network thread is that after a certain point, you can’t gain more parallelism in issuing requests to FDB by adding more threads within your JVM. All of your JVM threads will still be submitting requests to the one FDB network thread, and once you saturate the FDB network thread, that’s the maximum throughput you’re going to get. I would strongly suspect that you aren’t going to saturate the FDB network thread from a single threaded lucene though. It’s mostly just a thread doing a memcpy and send/recv as far as I’m aware.

I’d ballpark the “optimum transaction to chunk size” as around 50KB-100KB commits. Our general advice is to stay under 1MB in total for commits, and FDB will batch transactions together if it can accept more. I agree that in your use case, you’ll be inserting data sequentially, and that should be okay. Bulk loading an empty cluster sequentially puts a lot of strain on data distribution to try and carve out sensible shards.

And tangentially, I’ve taken this opportunity to finally go file an issue to track optimizing shard splitting for sequential insert workloads #1322.

Overall, full text indexes backed by FDB seem feasible, but I unfortunately don’t know enough about the particulars of lucene to be able to comment about it specifically. The FoundationDB Record Layer provides a full-text index on FDB. The authors of that would be able to speak to this better than I, so I’ll leave it to @alloc and @nathanlws. The former has a previous post on the subject of the full text indexing that you might find interesting. The latter you’ll recognize as also the author of the lucene layer you’re using.

(Tony Sun) #3

Appreciate the detailed response, I’m currently testing the new implementation on a 6 machine cluster. I’ll also be testing your parameters that you advised. I’ll update this thread with any important findings.

(Nathan Williams) #4

Unfortunately there’s not much more or promising to say about the Lucene Layer.

It was a week long experiment on how an integration might look, but otherwise no one who really knew anything about Lucene was involved with it. I’m sure the implementation is rather slow (the mapping is fairly readable vs any emphasis on efficiency, likely has correctness issues (not all tests passed), and hasn’t been updated in 5+ years.

@MMcM may be able to shed more light on the Record Layer text support, which is much more recent and actively being maintained.

(Alec Grieser) #5

As to Record Layer’s support for text indexes, most of our docs are I think in the TextIndexMaintainer, though it also looks like there are some Javadoc rendering bugs on that page (at least at the current version) that someone should probably fix (and expand upon).

It has basic search capability with a pluggable serializer. It’s more optimized for maintaining many little text indexes (e.g., maintaining many indexes over relatively small copora instead of a relatively small number of indexes over a very large corpus), so it doesn’t do things like cache results as aggressively as other text search solutions, but it does offer transactional updates with documents inserted into the record store, which is nice. (Note also that updates are likely to conflict, so it’s not super great if you need to support adding a bunch of documents in parallel.)

I guess I can answer specific questions about the capabilities of the text index if there are more questions.