The main mention of the multi-version client seems to be in the docs here.
I’m trying to understand what exactly happens when. We’ve got various systems that are currently loaded with 2 different FDB versions (7.1 and 7.3), and every now and again we’ll see a cluster_version_changed
error logged.
We mostly use .read
and .run
, but occasionally for various reasons we manually create a transaction and choose when to complete it (We’re using Java bindings to call FDB from Clojure, which has a lot of laziness. So in some places we were having issues with dropping out of the ‘scope’ of the read/run before we’d actually made use of the returned data, so before the tx had actually been executed).
So my first question is, is it possible to see cluster_version_changed
returned as the final state from .read
/.run
, or is it always retried until some other state is returned there, and all our instances of it will be where we’re manually creating and finishing txes?
Secondly, how and when does the client decide which network thread is ‘correct’? We’re calling .open
to get a connection right as our app starts up, but we might not then create a tx and try to .get
or .set
until the first external request hits the JVM. My initial reading was (and our initial testing suggested) that every tx was sent to all version threads in parallel on the expectation that one would succeed and the others would throw, so if we were setting some external in-memory state within the tx, before the first get/set, we would see that updated twice instead of only once.
But maybe that’s only on the first tx after .open
, and subsequent txes only use the one correct thread until cluster_version_changed
is thrown again? Or is there some periodic time after which it’ll retry all threads? Or does it pick a random one of the threads first and run it through that, and what read/run do if they get cluster_version_changed
is try another, so they’re never running a request for multiple versions in parallel? If they are running in parallel I’m trying to understand what determines which result you get back? I’d expect the failure-case to return first nearly all the time because it has the least to do (fail on first get/set), so does it only return that if the ‘correct’ version thread times out or similar? Or will it nearly always be the case that the transaction runs twice, once in the correct thread initially, and once as a retry that knows which version to use when the incorrect thread returns cluster_version_changed
?
We’re trying to implement idempotency using something akin to the atomic_idempotency
function from here and understand how that interacts with the multi-version client (given that the inbuilt feature explicitly doesn’t support it and isn’t prod-ready yet) and if there’s any pitfalls there where we could get caught out.
In a lot of our code this is a non-issue, because the tx removes an item from a queue and processes it, adding more data to the DB as a result. So if there is any sort of conflict or error and retry, then if the transaction did actually succeed anywhere then the item will no longer be on the queue and something else will be pulled for processing. This is only a problem we are looking at in specific edge-cases where we are having to add idempotency on top of transactions that do not inherently have it ‘inbuilt’.
At the moment our assumptions are that:
- It is not possible for a network thread of the wrong version to get ‘past’ an FDB get/set call in a tx, it will always throw a
cluster_version_changed
error at that point at the latest. - When using the multi-version client, it’s more likely that a transaction will be run against the cluster multiple times (first in parallel with ‘bad’ versions, then on the known-good version).