Are version_stamps guaranteed to be unique and monotonic for the life of a FDB cluster? Are there scenarios when version_stamps may be reset to value lesser than one already generated?
In this discussion Christophe mentioned that in case a new cluster is being restored from a backup, the version_stamps may be reset to 0 (or more generally to some start value that is <= to a value already generated earlier in the fdb cluster from where the backup was taken).
By the way: there is no guarantee that version stamps will always go up: if you are restoring from a backup after completely reinstalling a new cluster, it may be possible that the read version starts again from 0… The conditions to make this happen may be improbable, but not impossible!
Is there a way to avoid this? Maybe by including in backup some meta-data that tells the FDB cluster to start version_stamp generation from a given value onwards? Or some other approach that clients can themselves take to get around this? Any pointers will be very helpful.
I was thinking of using version_stamps for multiple purposes:
as a reference key between two key-values (e.g. data_row, and corresponding index_row).
as a prefix key in a log index so that I can “tail” the index from a given bookmark (version_stamp), and cycle…
etc.
But if I cannot guarantee the uniqueness and monotonicity of these then it becomes difficult to use it for above use-cases.
I have never heard of that potential until now, but that’s quite annoying if true.
My idea for a workaround is to prefix the versionstamp with a counter that represents the version of that installation of the database.
So if you choose to use a 1 byte prefix, you have 255 cluster re-installations if you increment that prefix upon each re-installation as a part of the bootstrap process after restoring the data, but before starting the application.
Yeah, that sounds like a reasonable enough workaround for this. You could also use a tuple encoded integer instead of a single byte. It works out to be an extra byte for values between 1 and 255, but it also means that you aren’t limited to 255 restores.
You could also imagine using the prefix if you want to move data between clusters for other reasons. Suppose, for example, that you had multiple queues (maintained using versionstamps) all from the same cluster. Then you might decide you want to start sharding across multiple clusters (because maybe you want to serve some queues from one locale and another set of queues into another locale). Then you can use this prefix to safely copy the queue from one cluster to another (with the prefix essentially being the number of times you’ve copied the queue from one cluster to another).
Also, there is already work that bumps the current database version on a DR switchover so that the same versionstamp log can be used even if you switch from the primary to the secondary. In theory, similar work could be done on a database restore. But this would definitely require a fair amount of core development.
I will also say that restoring from a backup should be a fairly rare occurrence. If you are restoring data into the same cluster because an application did something like delete an extra range accidentally, then you want to restore with the same versionstamps as you did when you inserted the data the first time (or your references won’t match up). If you have to restore from backup because some catastrophe meant that all of your data were lost, then you’re in a somewhat stickier situation, but that should be very rare. At that point, it might be safer to reinsert all of the data in your application using a new version history anyway for data integrity reasons.
Thanks for the suggestions! I can easily incorporate these in my store design.
I think this information regarding VersionStamps, and potential workarounds may generally be useful in main documentation for other trying to model their applications using VersionStamps
If you plan on operating multiple clusters and wish to move objects which reference versionstamps around between the clusters.
For example, if you have a change log of data for each user of your app and users can choose to move their data between clusters hosted in different geographic regions.
Yeah, that all sounds right to me, both that the previously identified problem of versions going back in time if you did a restore should be resolved with that commit and that you still might want a “generation” prefix to say how many times you’ve moved data between clusters because that’s not necessarily done with restore. But now, you don’t need to somehow factor in the number of restores into your generation number; it is sufficient to make it the number of moves.
One use case that we had: custom backup/restore, performed at the application level, without using the fdbackup/fdbrestore tools.
For example, all documents are exported into some format on disk (JSON, protobuf, …) without any secondary indexes. When reimporting, indexes are recomputed on the fly.
It is frequent to reimport the data on a test/dev cluster that is wiped before reimporting (stop, delete all files, restart, “create new single ssd”, …). In this case, the read version will start a zero, while the imported data may contains versionstamps that are from the source cluster.
One solution would be an admin command to tell the cluster to skip ahead to at least read version X (the custom backup tool would need to store somewhere the version of the cluster at the time of the backup). Maybe some special key in the \xff/.... keyspace? Or would it require a new API exposed by the C API ?
But I’m surprised that the code does not check if the new value is lower than the current, and unconditionally update to what the caller asked for. Looks like potential disaster if not used carefully…
You can set the minimum required version to whatever you wish. Lowering it just means that recovery is allowed to pick a lower version. The check that you wish to see is in the recovery logic.