Optimizing a single large transaction ( 10,000 keys)

I have a usecase to optimize a single large transaction which add/update/delete more than 10K keys. Value size are about 1KB with a upper bound of 32KB. Key size are typically 64bytes. The biggest problem is the number of keys which are part of the transaction.

The bottleneck is mostly due to the business logic involved but I am trying to figure out the best way to optimize it.

Business Logic - For each key lookup the previous instance from db (if exists) and check update count between old and new object. Typical flow is that the client does a plenty of business processing so the old reads could be from a different read version. So we need to do a second level check before writing to the database.

Transaction Mutex - For certain type of transactions we need to serialize the writes, so we use a global mutex which gets updated for every write. This is how we serialize across multiple clients.

Whole transaction is run by a single thread in a serial fashion.
Each key is stored twice in the database , One with a very small piece of metadata and other with actual value. Metadata is what is pulled for the checker. This includes the update count.
The keys are not ordered. I cant do a range lookup to fetch those keys.
Its common to have 100 million keys in a production database.

Measured latency - The end to end latency observed is about 2.5 seconds for 10K keys. Of the 2.5 seconds about 1.5 seconds is taken by the checker function which looks up previous value of the new key.

Questions -

  1. In java binding is the transaction Object thread-safe ? My idea is to spin multiple threads with the transaction object which will work on a set of keys in parallel. If its not thread-safe then create multiple txn and set the same readversion for lookups and run writes in the main transaction.
  2. 99% of the keys will be insert - which means that key shouldnt exist in the database. But still I have to lookup this key to be 100% certain no-one hasnt created this as part of different transaction.
    Question - Is looking up a non-existent key more expensive than a key which is recently written?
    Is there any technique I can use to distinguish between create and update in an optimal way ?
  3. I suspect that since we read a very large key set as part of the same transaction the resolver has to validate a large set of keys for read-conflict. Is my understanding correct ? Any way to optimize it ?
  4. Will disabling RYW cache improve write performance ?

Thanks for your help.

Note that there is a 10 MB limit to the size of FDB transactions. While the exact calculation there is somewhat complicated (as it includes the conflict ranges from reads and writes (more-or-less)), at a minimum, it must include all keys and values written in the transaction (and all key ranges cleared–just the boundary keys, no values for clears). If you are updating 10k keys with an average size of 1 kB, then that is enough to hit the 10 MB limit. We also generally recommend that one try to keep transactions under 1 MB, if possible, as performance can degrade beyond that point. So, you may need to work around that limitation to make this work. (How exactly you do that kind of depends on your use case, etc.)

See: https://apple.github.io/foundationdb/known-limitations.html#large-transactions

Yes, the transaction object is thread-safe (modulo bugs). There are a few caveats to note, though. The first is that the FDB C client (which the Java bindings wrap) is itself single-threaded, and most client operations just push a task onto a task queue which the C client then consumes later. This means that multiple threads on one transaction each should have about the same performance (ish) as multiple threads on one transaction. The second is that all I/O to and from the database servers is asynchronous, so you can take advantage of parallelism even with the single-threaded architecture.

I’m not 100% sure, but possibly. Note that recently written keys are likely to be in memory on the database servers, whereas that is less likely for non-existent keys. (Note that FDB keeps all recent mutations in memory for 5 seconds, at least, and then the pages are just likely to be in the page cache after that. Note that the pages that would contain that non-existent key might be in cache, so there isn’t, for example, a guarantee or anything that the read will be entirely out of cache.) I would expect an out-of-cache read and an out of cache read of a non-existent key read to take about the same amount of time.

It’s possible I’m wrong about this, and someone with more knowledge about the storage engines would have more informed opinions.

Purely relying on FDB, I don’t think so (like, there aren’t efficient “create if absent” primitives). There is possibly a way to do this at, like, the “layer” level (e.g., storing in one key information about what other keys are written), though those will all have trade-offs (e.g., write amplification from maintaining that index).

Yes, the resolver would have to check each key (ish) for any conflicts (so your understanding is correct). One trick you can play is you can add one single conflict range over all your data so that the resolver only needs to check it once. (So, for example, over the whole database with conflict range on from the empty byte array to \xff, or if all of your data live in a single subspace, then over subspace.range().) Then each read, rather than adding a new conflict range, will just notice that its read range is already contained in the already extant read conflict range. This has the disadvantage that your transaction could be failed (with a conflict) if any thing in that range changes (including on a key you didn’t actually read), but it has the advantage of minimizing the number of conflict ranges (and therefore resolver work).

You can play a similar trick with writes (adding a single write conflict range containing all data you might write).

How much this buys you and whether it’s worth it may require some tinkering. We actually do this in the Record Layer with some of our text indexes to minimize resolver load, but it may or may not help your use case.

I wouldn’t expect it to make that much of a difference, but I could be wrong. I think we had performance tests at one point, and I think the answer was that it might help like a little, but not too much. Note that it can actually also cause performance problems, as the RYW cache also controls (1) the read cache (so reading the same key twice in the same transaction, with RYW enabled, reads it from memory the second time, but if RYW is disabled, then it goes to the DB each time) and (2) mutation coalescing (so if you write the same key twice in the same transaction, with RYW enabled, a single mutation is written to the final transaction, but if RYW is disabled, then there is one mutation in the final transaction for each time you write to that key). Depending on your use case, those may or may not be optimizations that you’re relying on. And then, well, there’s also possible serializability problems introduced if you do read a key you wrote and don’t see your own write.

Thanks the detailed answer. It is very useful.

Bigger transactions will be close to that limit, at the moment we are running with about 15MB txn size. We havent noticed much of a performance difference between 10MB and 15MB. Also this is an extreme scenario and the normal loads about 1MB transactions.

There is a bit of processing involved after every key is read, so processing in multiple threads actually help us.
Having said that I see the performance doubles when I use 2 threads and after that point parallelism doesnt help. This is most likely the processing time in our layer than anything to do with fdb-driver itself.

I think it will be super useful to have such a feature (at least for us). I dont know what others think of this. If there is interest I can raise a feature request for the same.

I think if I can get this working then it will save a lot of processing time in the resolver. I will give it a go, thanks for the suggestion.

One more random question, is there a way to disable read-conflict checks for a specific transaction ? In some cases write-conflict is good enough …

Pease see if this helps: https://apple.github.io/foundationdb/developer-guide.html#snapshot-reads

This is exactly what we need. Thanks.

Happy to help!

Oh, are you playing with some of the knobs to make that happen? I suppose that’s a thing you can do, though one should always be careful when fiddling with those, as changing their values can have sometimes degrade performance (or sometimes correctness (!!)). Large transactions like that, for example, can cause SlowTasks on the proxy (and maybe log) processes.

I think one could add this through the atomic mutation framework that we already use for things like “increment a value by some amount”, so I don’t think this feature is incompatible with FDB’s architecture. So, if you were to request the feature, I think it could be done (if there is someone with the time to do it).

If you perform all of your reads at “snapshot” isolation level (e.g., call tr.snapshot() to get a read-only view of the transaction), then none of the reads will add read conflict ranges. This is only a good idea if you really are sure you only need to the write conflicts, but it’s a thing you can do.

Just to be clear, FDB doesn’t abort on write-write conflicts. If you have two FDB transactions, that in time order do:

T1:    Write(X)                Commit()
T2:                Write(x)                 Commit()

Then both T1 and T2 will successfully commit.

yes, we are tweaking the knobs to cater for bigger transactions. Txns bigger than 10MB are outliers and shouldnt happen often. We are OK with things being slow for bigger transactions but I would assume correctness will not be compromised ?

If correctness is compromised then the knob should not be even exposed in my opinion.

Thanks for the feedback. I will add a feature request.

Not sure if I fully understand this example, you mean if T2.ReadVersion <= T1.ReadVersion && T1.commit() succeeds then T2 will not result in write-conflict (atleast gets retried) ?

We will be fine if the transactions gets retried. But minimally retry should trigger.

For a transaction to be conflicted it must satisfy two things:

(a) It must have read-conflict range - either added implicitly when doing a read via get/get_range, or an explicitly added a read-conflict.
(b) AND it must have a mutation (set or clear).

The transaction is said to be conflicted if at the commit() time, any of its read keys have been possibly modified since the getReadVersion of this transaction.

If either of the two properties is not satisfied, then the transaction cannot be conflicted (or invalidated) by other transactions.

In the example that Alex gave above, property (a) is not satisfied.

Similarly, a read-only transaction does not satisfy property (b) and thus cannot be conflicted (underneath, such a transaction does not even make a commit() call, as there is nothing to commit; thus no conflict checking is possible for these).

I got it. so

T2 { Read (x) ; write (x); commit; } // conflict since x is read and then updated.

will result in a conflict. That makes sense and we are fine with it.

Yes that is correct. But note that at snapshot transaction level (that you asked about earlier), a read R(x) will not add a read_conflict implicitly to the transaction; and so the transaction will not conflict. You can either not use snapshot transaction, or use them but add a read_conflict explicitly as demonstrated in the doc I linked earlier.