FoundationDB

VersionStamp vs CommittedVersion


(gaurav) #1

[EDIT: trying to go over the discussion at Implementing VersionStamps in bindings to see if it provides answers]

Hi,

What is the relation between values of VersionStamp and a CommittedVersion? More specifically:

A VersionStamp is 12 bytes long - first 10 bytes containing the "transaction" version and last 2 bytes containing "user" version. What do the first 10 bytes actually represent? I am a bit confused here because the getCommittedVersion() API returns a Long which also, seemingly, represents the transaction version number; in that case, what are extra 2 bytes (out of 10 bytes “transaction” version in VersionStamp) represent?

–
thanks,
gaurav


Use case of versionstamp and behavior of pack_with_versionstamp
(gaurav) #2

I read through the linked discussion but still have a few questions:

  • Is 8-byte Transaction Version unique across transactions?
  • What is the significance of Transaction Batch Order (last 2 bytes out of total 10 bytes that are stored on server); how is it computed? I suspect that it is providing some ordering for multiple writes happening in same transaction; Is that correct?
  • Is the Transaction Batch Order related to user version that Java client API exposes? How?
  • Seems like the user version only a client-side concept. How is it used to determine the final 10 bytes stored on server.

My use-case is that I am trying to write an API to generate a unique long if for a given key: long getOrCreateId(String guidKey)

I am trying to using VersionStamp’s first 8 bytes to get the Transaction Version which I am assuming to be a unique id globally.

byte[] ver = db.run(tx -> {
    final byte[] guidBytes = Tuple.from("key").pack();
    {
        // string->id
        final byte[] v = Tuple.from(Versionstamp.incomplete()).packWithVersionstamp();
        tx.mutate(MutationType.SET_VERSIONSTAMPED_VALUE, guidBytes, v);
    }
    {
        // id->string
        final byte[] k = new Tuple().add(Versionstamp.incomplete()).packWithVersionstamp();
        tx.mutate(MutationType.SET_VERSIONSTAMPED_KEY, k, guidBytes);
    }
    return tx.getVersionstamp();
});

ver.thenApply(versionStamp -> {
  final ByteBuffer bb = ByteBuffer.allocate(Long.BYTES).put(versionStamp, 0, Long.BYTES).order(ByteOrder.BIG_ENDIAN);
        bb.flip();
        return bb.getLong();
  });

At some later point, if I only have the “Transaction Version” (8 bytes) available, can I create a key using it to lookup the row in “id->string” index? In other words, I do not have a need for the last “2 bytes” of server stored VersionStamp; can I force it to be 0 (for above write transaction)?

–
thanks,
gaurav


VersionStamp uniqueness and monotonicity
(Christophe Chevalier) #3

The user version is a purely client-side convention, and the cluster only knows about the first 10 bytes. The next two bytes are managed by the client (with a local counter per transaction), but you could use your own convention in your application to be able to insert more than one versionstamped keys in the same transactions.

Most bindings offer a 96-bit VersionStamp as a convenience if you don’t want to reinvent the wheel, but under the hood, the only thing seen by the cluster is the first 80-bit. In practice, the binding will send 10 x 0xFF bytes for the actual 80-bits, followed by the 2 byte for the user version (but it could have been any number of bytes).

I don’t think that the Java binding will allow you to only use 80-bit stamps, but this is an API decision at the binding layer. The .NET binding allows you to use both types, for example.

I don’t think this assumption is true, especially under load if the cluster decides to merge transactions. That’s probably what the batch order is for. So you should use the whole 10 bytes as the global id to be 100% correct.

If you do some local testing, and always see the batch order equal to 0, it is probably because you are not able to generate enough concurrent load to trigger merging of transactions?

By the way: there is no guarantee that version stamps will always go up: if you are restoring from a backup after completely reinstalling a new cluster, it may be possible that the read version starts again from 0… The conditions to make this happen may be improbable, but not impossible!


(gaurav) #4

Thank you Christophe.

I ran a few more experiments after it and I think things are a bit more clear to me now. I will put the summary here for others to review:

  • FDB Server (core-backend) understands the version_stamp as a 10 byte value. All of these 10 bytes are determined by the server. Together these 10 bytes are guaranteed to be unique and monotonic (no guarantees on uniqueness on first 8 bytes of these 10).
  • Java (and python) client bindings provide a 12 byte VersionStamp class that is made up of 10 byte placeholder for “real” (server-controlled) version_stamp, and 2 more user supplied (arbitrary values) - user version.
  • The user version value make it to server (i.e. persisted) and are directly appended AFTER the 10 byte version_stamp. I have not yet done the last part of experiment, but I am almost certain that if I parse the retrieved complete byte[] (key or value) using the Tuple layer, it will reconstruct/separate 10 byte version_stamp and user version from it.

Few things that I am yet trying to determine:

  • What is the significance of getCommittedVersion() API. It seems like first 8 bytes (of 10) being returned by it are purely implementation details and so what is the reason for exposing it out?
  • Same question for getReadVersion() API.
  • Is version_stamp value (10 bytes) guaranteed to be unique for the lifetime of database? What about scenarios when the data is being restored from backup - is there a chance that the version_stamp can get reused from earlier (as hinted by Christophe)? If this is so, then I am curious what are some good use-cases of version_stamp - if uniqueness cannot be guaranteed in all situations? I thought that good use cases were (a) efficiently generating unique id for things and using this id in other rows (b) as a sequencing prefix for indexes providing log semantics.

(Alec Grieser) #5

For the purposes of versionstamps, I suppose that’s true. But the version means something more, in particular, all operations occur at a version. (Essentially, these are the versions we use for MVCC.) After a commit has completed, precisely any transaction with a read version at least as large as the commit version will see the results of that commit.

We also offer a setReadVersion() function that takes this 8-byte value. So, in theory, you could call getCommittedVersion() on a transaction and then create a new transaction, call setReadVersion() with the commited version you just got, and the new transaction would be able to see the commit you just did. Because of how the cluster gives out read versions, you would also be assured to get a read version at least as big as that version any way, but maybe you’ve saved a round trip. (You could also feed setReadVersion() the result of previous calls to getReadVersion()–essentially caching your read versions and accepting stale reads instead of a round trip.) We don’t normally recommend this–it’s certainly an advanced technique and should only really be done by someone who knows what they are doing™–but it is there.

Yes, for the lifetime of a database, no two transactions will get the same 10 byte versionstamp. If you are careful about how you give out user_version, you can maintain the invariant that no two keys have the same 12 byte versionstamp associated with it.

Yes. The versionstamp essentially goes up from 0 and increases by a (roughly) constant rate. If you restored data into a fresh database, then there is no guarantee that the database’s version is greater than the original one. If you restore into the same database (like, let’s say you back up a database, clear it out, and then restore), then the monotonicity and uniqueness of versionstamps will be maintained.

Those are still all good examples of cases where you might want to use versionstamps. Within a database, the uniqueness and monotonicity of versionstamps is always maintained. The only problem is if you take data from one cluster and put it into another one (perhaps through a restore, perhaps because you moved data around). Adding a prefix indicating the “generation” of the data (the number of times it has been moved around) is one way to get around the fact that you can’t depend on the newly restored to cluster being correctly ordered. If we had a way of advancing a cluster’s version arbitrarily, that might be another way.


(gaurav) #6

Thank you for the detailed clarifications; I really appreciate you taking out time to answer these :slight_smile:


(Amirouche) #7

Very interesting. It helps in the design of my layer where I want to be able to pull somekind of timeseries, for instance from production cluster to preproduction cluster.