I think you have a few ideas already on how this might be done. In addition to what @alexmiller wrote, I will add a few of my own thoughts.
One strategy that can be used is to “version” your schema, either with some kind of hash, some kind of incrementing “schema version” number, or some combination of the two. As you suggested, you can keep multiple copies of the schema version information around in the database. Because this version/hash information is a lot smaller than the schema itself, you can probably get away with doing this without needing to worry about hitting the transaction size limit (of let’s say 1 MB). So for most operations, your flow is:
- Pick a schema version key at random.
- Check if the version is the same as your cached schema.
- If not, read the schema from the place where the schema is stored and cache it for future operations.
- Perform operation.
Then to do schema update, its:
- Write the new schema to the schema place.
- Update all of the version/hash keys.
- Commit your single transaction.
If you had, let’s say, 100 such keys then the transaction is, what, maybe 50 bytes per version key plus the schema size, so let’s so 5 kB for the version keys and however large your schema is. This can easily be done in a single transaction.
Additionally, I would assume that for a sufficiently complicated schema, deserializing it from the database and producing a version of it that you could use within your application is costly enough that you wouldn’t want to do it with every operation. So even if you had a single version/hash key that you consulted every time (which would be a hot-key, yes), then at least you wouldn’t have to grab the full schema each time.
Notice that nothing in the above required a change in the FoundationDB software. (In other words, it could be implemented by a layers developer today.) In some sense, most of the other suggestions I’ve come across on how you could better handle this all kind of boil down to, “How could you make it easier for FoundationDB to do (what’s described above) for someone?” This would include:
- The ability to mark some keyspaces as requiring additional replication. Then your schema version queries would be load-balanced across more servers (i.e., across the additional replicas), so at least you’ve got that going for you, which is nice.
- If I’m understanding @alexmiller’s suggestion correctly, the idea that each storage server could keep schema information in its local storage. In some sense, this is taking the idea that some keyspaces should get additional replication ad infinitum and just putting that information everywhere. As long as there aren’t too much schema data (e.g., probably fine for the schema version; maybe not fine for the schema itself, or maybe both are fine), this should work. (I believe @alexmiller is also coupling the storage of the meta-data with server side schema-validation. I think this is optional, as the client could also just verify this if they have access to it, though I can see the performance boost from saving a round trip).
But even with the additional optimizations, the above does require an extra read at the beginning of every operation (i.e., an extra round trip), which would be nice to avoid. There are a few ways of fixing that, some of which require FoundationDB to change.
One that doesn’t is somewhat analogous to your suggestion that you could write a database version in the future. I think that would work, but I would suggest using the commit version of the schema-updating-transaction rather than the read version (which can be determined by using the set_versionstamped_value mutation). I think this is necessary because if you want to enforce that meta-data mutations don’t happen more often than once every N database versions, then when you need to make sure that when you write out the meta-data update that the new version is at least N database versions then your commit version (or you might have other transactions who read the schema information after your read but prior to your commit who now have fewer than N database versions between now and when the schema is updated). I suppose if you somehow knew the maximum number of database versions that could exist between a transaction’s read version and its commit version, you could take that into account when you wrote the version at which your meta-data should start being used, but even with that, it seems safer to me to use the commit version.
Then you can almost do something like:
- Get the read version of a transaction.
- Check to see if the read version is before or after the atomic switchover version.
- If before use old schema; if after, use new schema.
The problem occurs if there is a transaction that starts (i.e., gets its read version) before the atomic switchover version but commits (i.e., gets its commit version) after the atomic switchover version. I believe you could probably get around this by having two switchover versions, the first of which is the version that it is no longer safe to perform reads after and another version (which must be at least as big as the maximum gap between a read version and a commit version for a given transaction–whatever that is) after which no writes can take place at the old version. If this is possible or not will depend on the nature of the schema change, and some will only require a single version. For example, if you add a table, then I think one version is fine unless you have any cross-table constraints (like foreign key constraints). For something like dropping an index, you can do things like stop allowing any queries to that index starting with the first version, accept that maybe a few extra inserts will write index entries to the index between the read-cutoff-version and the write-cutoff-version, and then delete the index data after the write-cutoff-version.
I don’t like that that solution requires knowing the maximum difference between a transaction read version and a transaction commit version (which really sounds like database internals to me that the user shouldn’t rely on). Correctly orchestrating the list of read-acceptable and write-acceptable things for each part of the schema seems like it could be difficult to me as well, though the F1 paper that @alexmiller mentioned kind of goes into this, so maybe it wouldn’t actually be that hard. You also need something along those lines anyway (probably) to handle things like adding an index, where the index has to be back-filled with old data and updated inline with new data, so maybe it’s not so bad. Hm.
One thing you could do to solve that problem would be to have the ability to add a write-conflict-key in the future. This does not exist write now, but if it did, then you could theoretically use it to invalidate any transactions that cross the atomic switchover line. The problem, however, is that that would have to be durable (in the general case–if it is mandated that this version is less than the number of versions that gets increased during a recovery, whatever that may be, it might actually be fine to leave it as an in-memory only thing), which means writing it somewhere that probably isn’t super scalable, though maybe it’s fine if it’s an “updated once every day at most” kind of value.
There are also a few changes you could imagine to FoundationDB that would make it possible to do this kind of thing without the extra round trip, but they would require some internals changes. From what I can recall being suggested by other people at other times, they would be:
- The addition of a piece of meta-data that is returned with a get-read-version call. This meta-data should be user-settable (or maybe user incrementable), the idea being that each time the schema is updated, you also update this key. Then you can essentially use it as, like, a “cache invalidation” token. In some sense, this is the opposite of some of the other suggestions that have suggested spreading the meta-data out to more places insofar as it is putting this information into a centralized location and instead depending on the fact that it’s going to places that we already need to check during a transaction’s lifecycle in order to make work. This key would be relatively low-throughput and wouldn’t scale well to having more than 1 (or, well, maybe more than like 10), but maybe that’s fine for this purpose.
- This would only work for read-write transactions (or read-only transactions where the read-conflict ranges get verified at the end–which is not the case right now), but if you could set a read-conflict-key for a key “in the past”, then you could imagine adding that conflict range to your schema version key. Then your transaction will be failed at commit time if the schema has been changed. The issue here is that you have to be able to handle arbitrarily bad looking data during the course of any transaction who gets a read version after the change has happened but doesn’t learn that it is an old schema version until later. (And of course, you have no ability to validate everything is fine for read-only transactions.)
There is another strategy that maybe doesn’t require an FoundationDB changes, but it is a little harder to pull off. Here the strategy is that you could design your app to have two modes:
- Optimistic mode: Always assume the schema is up to date.
- Pessimistic mode: Never assume the schema is up to date.
Most of the time, you run in optimistic mode for performance reasons. Right before you change the meta-data, you push out a change (maybe a new version of your application; maybe you change some runtime property if you can do that) and switch your application to pessimistic mode, so now maybe you have a hot key or maybe all of your operations require an extra round trip or whatever. Then you push the meta-data change, and everything starts to use that (which is picked up right away because they are all in pessimistic mode). Then you switch your application back to optimistic mode.
This only works if you can be certain that all of your application instances switch over to the new mode and back. If it’s possible that you have some app instances that are partitioned away from the main system and might miss the update, then you can get into a really bad situation where you have multiple schema versions being used at once. For this reason, I generally wouldn’t recommend it, but it’s possible that it might work in some environments.