Implementing atomic DDL for SQL schema

Implementing a SQL database as a layer with a mutable schema logically requires storing the schema somewhere. In order to use that schema, it needs to be read during a transaction, or somehow updated in a local cache before the query starts.

Is there a strategy that has worked for this type of pattern in the past that allows schema modifications to be atomic?

The first problem is that simply reading the schema on every transaction would bottleneck throughput at the throughput of the storage servers where the keys for the schema are. Storing the schema in some sort of a log structure and using a watch would make it eventually consistent as far as I can tell, which may or may not be a problem, but would require more work at the layer level to present a consistent view to the end-user.

A third option is storing the schema in multiple places such that each copy is most likely on a different storage server, and updating all copies whenever the schema changes. This would potentially bump up against limits around how much data can be written in a single transaction, but that would be quite a large schema! The layer process would pick a spot to read the schema from at random from the spots it has been written to.

Am I missing any more options that would allow me to provide transactional schema modifications? This is all assuming the DDL in question does not require writing any new data.

If you haven’t skimmed through the paper on F1’s schema changes, I’d highly recommend it.

As commented in the paper, you’d probably wish to do explicit reads every so often rather than a watch, as theoretically, your layer process could be partitioned away from only the storage servers that have the schema. Watches give you no notification of this, which means that partitioned layer processes could continue to process user queries at significantly older schema versions, potentially causing you to have inconsistent data. If you require re-reading the schema every 1 minute, and refuse to answer queries if you haven’t read the schema within the past minute, then you’ll prevent this.

Presenting a consistent view of a schema change to a client would only require consistency within a session, which is much easier to do than across a whole database. Your clients can always send the most recent version of the schema that it knows about. If the server doesn’t have that version yet, it waits until it does to handle the query (or goes and fetches the schema, or returns a retryable error). If the server has a newer schema, then it corrects the client and distributes the new schema with the version.


If you wanted the completely atomic cutover semantics, I feel like implementing support in FDB to be able to:

  1. Read from the storage-server local data (that’s secretly maintained in \xFF\xFF\xFF, but completely inaccessible from outside the storage server currently)
  2. Do commits that push to this space on every storage server

Would give you your easy way to do a local read on any storage server to get your schema. A schema would change atomically at a version, and any server you contact for a read should also be able to give you the schema metadata.

1 Like

The atomic cutover semantics would be ideal. Modifying FoundationDB to add that feature definitely sounds useful and I could see other types of small reference data that are needed frequently by layers using it.

The F1 schema evolution paper and the concept of schema leases gave me an idea as well. If you limit the scope of the problem from “the schema can change at any time” to “the schema can change at most once every N database versions” where N is a sufficiently long window to keep the read rate low on the schema keys, a new schema could be written to a database version in the future. It would perform the same operations as described in the F1 paper around schema leases. So long as N is less than the maximum duration of a transaction, you wouldn’t need extra write fencing because that would be handled by FoundationDB not allowing a commit of older than the maximum duration.

I would have to think about this a bit more, but I think you could rely on the database version to provide the atomic cutover if you narrowed the problem as I described.

That paper is very helpful regardless, so thanks!

Just for clarity, “write to a database version in the future”, I mean take the read version of a transaction and increment it by N, then write the schema back at that calculated key.

I think you have a few ideas already on how this might be done. In addition to what @alexmiller wrote, I will add a few of my own thoughts.


One strategy that can be used is to “version” your schema, either with some kind of hash, some kind of incrementing “schema version” number, or some combination of the two. As you suggested, you can keep multiple copies of the schema version information around in the database. Because this version/hash information is a lot smaller than the schema itself, you can probably get away with doing this without needing to worry about hitting the transaction size limit (of let’s say 1 MB). So for most operations, your flow is:

  1. Pick a schema version key at random.
  2. Check if the version is the same as your cached schema.
  3. If not, read the schema from the place where the schema is stored and cache it for future operations.
  4. Perform operation.

Then to do schema update, its:

  1. Write the new schema to the schema place.
  2. Update all of the version/hash keys.
  3. Commit your single transaction.

If you had, let’s say, 100 such keys then the transaction is, what, maybe 50 bytes per version key plus the schema size, so let’s so 5 kB for the version keys and however large your schema is. This can easily be done in a single transaction.

Additionally, I would assume that for a sufficiently complicated schema, deserializing it from the database and producing a version of it that you could use within your application is costly enough that you wouldn’t want to do it with every operation. So even if you had a single version/hash key that you consulted every time (which would be a hot-key, yes), then at least you wouldn’t have to grab the full schema each time.

Notice that nothing in the above required a change in the FoundationDB software. (In other words, it could be implemented by a layers developer today.) In some sense, most of the other suggestions I’ve come across on how you could better handle this all kind of boil down to, “How could you make it easier for FoundationDB to do (what’s described above) for someone?” This would include:

  • The ability to mark some keyspaces as requiring additional replication. Then your schema version queries would be load-balanced across more servers (i.e., across the additional replicas), so at least you’ve got that going for you, which is nice.
  • If I’m understanding @alexmiller’s suggestion correctly, the idea that each storage server could keep schema information in its local storage. In some sense, this is taking the idea that some keyspaces should get additional replication ad infinitum and just putting that information everywhere. As long as there aren’t too much schema data (e.g., probably fine for the schema version; maybe not fine for the schema itself, or maybe both are fine), this should work. (I believe @alexmiller is also coupling the storage of the meta-data with server side schema-validation. I think this is optional, as the client could also just verify this if they have access to it, though I can see the performance boost from saving a round trip).

But even with the additional optimizations, the above does require an extra read at the beginning of every operation (i.e., an extra round trip), which would be nice to avoid. There are a few ways of fixing that, some of which require FoundationDB to change.

One that doesn’t is somewhat analogous to your suggestion that you could write a database version in the future. I think that would work, but I would suggest using the commit version of the schema-updating-transaction rather than the read version (which can be determined by using the set_versionstamped_value mutation). I think this is necessary because if you want to enforce that meta-data mutations don’t happen more often than once every N database versions, then when you need to make sure that when you write out the meta-data update that the new version is at least N database versions then your commit version (or you might have other transactions who read the schema information after your read but prior to your commit who now have fewer than N database versions between now and when the schema is updated). I suppose if you somehow knew the maximum number of database versions that could exist between a transaction’s read version and its commit version, you could take that into account when you wrote the version at which your meta-data should start being used, but even with that, it seems safer to me to use the commit version.

Then you can almost do something like:

  1. Get the read version of a transaction.
  2. Check to see if the read version is before or after the atomic switchover version.
  3. If before use old schema; if after, use new schema.

The problem occurs if there is a transaction that starts (i.e., gets its read version) before the atomic switchover version but commits (i.e., gets its commit version) after the atomic switchover version. I believe you could probably get around this by having two switchover versions, the first of which is the version that it is no longer safe to perform reads after and another version (which must be at least as big as the maximum gap between a read version and a commit version for a given transaction–whatever that is) after which no writes can take place at the old version. If this is possible or not will depend on the nature of the schema change, and some will only require a single version. For example, if you add a table, then I think one version is fine unless you have any cross-table constraints (like foreign key constraints). For something like dropping an index, you can do things like stop allowing any queries to that index starting with the first version, accept that maybe a few extra inserts will write index entries to the index between the read-cutoff-version and the write-cutoff-version, and then delete the index data after the write-cutoff-version.

I don’t like that that solution requires knowing the maximum difference between a transaction read version and a transaction commit version (which really sounds like database internals to me that the user shouldn’t rely on). Correctly orchestrating the list of read-acceptable and write-acceptable things for each part of the schema seems like it could be difficult to me as well, though the F1 paper that @alexmiller mentioned kind of goes into this, so maybe it wouldn’t actually be that hard. You also need something along those lines anyway (probably) to handle things like adding an index, where the index has to be back-filled with old data and updated inline with new data, so maybe it’s not so bad. Hm.

One thing you could do to solve that problem would be to have the ability to add a write-conflict-key in the future. This does not exist write now, but if it did, then you could theoretically use it to invalidate any transactions that cross the atomic switchover line. The problem, however, is that that would have to be durable (in the general case–if it is mandated that this version is less than the number of versions that gets increased during a recovery, whatever that may be, it might actually be fine to leave it as an in-memory only thing), which means writing it somewhere that probably isn’t super scalable, though maybe it’s fine if it’s an “updated once every day at most” kind of value.

There are also a few changes you could imagine to FoundationDB that would make it possible to do this kind of thing without the extra round trip, but they would require some internals changes. From what I can recall being suggested by other people at other times, they would be:

  • The addition of a piece of meta-data that is returned with a get-read-version call. This meta-data should be user-settable (or maybe user incrementable), the idea being that each time the schema is updated, you also update this key. Then you can essentially use it as, like, a “cache invalidation” token. In some sense, this is the opposite of some of the other suggestions that have suggested spreading the meta-data out to more places insofar as it is putting this information into a centralized location and instead depending on the fact that it’s going to places that we already need to check during a transaction’s lifecycle in order to make work. This key would be relatively low-throughput and wouldn’t scale well to having more than 1 (or, well, maybe more than like 10), but maybe that’s fine for this purpose.
  • This would only work for read-write transactions (or read-only transactions where the read-conflict ranges get verified at the end–which is not the case right now), but if you could set a read-conflict-key for a key “in the past”, then you could imagine adding that conflict range to your schema version key. Then your transaction will be failed at commit time if the schema has been changed. The issue here is that you have to be able to handle arbitrarily bad looking data during the course of any transaction who gets a read version after the change has happened but doesn’t learn that it is an old schema version until later. (And of course, you have no ability to validate everything is fine for read-only transactions.)

There is another strategy that maybe doesn’t require an FoundationDB changes, but it is a little harder to pull off. Here the strategy is that you could design your app to have two modes:

  • Optimistic mode: Always assume the schema is up to date.
  • Pessimistic mode: Never assume the schema is up to date.

Most of the time, you run in optimistic mode for performance reasons. Right before you change the meta-data, you push out a change (maybe a new version of your application; maybe you change some runtime property if you can do that) and switch your application to pessimistic mode, so now maybe you have a hot key or maybe all of your operations require an extra round trip or whatever. Then you push the meta-data change, and everything starts to use that (which is picked up right away because they are all in pessimistic mode). Then you switch your application back to optimistic mode.

This only works if you can be certain that all of your application instances switch over to the new mode and back. If it’s possible that you have some app instances that are partitioned away from the main system and might miss the update, then you can get into a really bad situation where you have multiple schema versions being used at once. For this reason, I generally wouldn’t recommend it, but it’s possible that it might work in some environments.

Oh, also, at some point, this switched from, “@alloc is answering a question about schema changes using techniques he’s known about and considered,” to, “@alloc is speculating wildly and generally musing about the problem”, so some of these ideas might be terrible or half-baked or under-explained. Sorry about that.

There would definitely be parts that require complex orchestration over longer than the duration of a transaction, as you mention and are also outlined in the F1 paper. The reason to want the changes to be atomic is to manage the state of indexes from non-existent, to write-only, to read-write, for example. If you can build on that, it would be simpler to know when you’re “done” with a given async schema change.

Thanks for the suggestions. I’m not confident enough in my C++ abilities to design new core features at this point, but the other paths seem workable enough for now to test out.

1 Like

Could you accomplish correct caching by adding a READ conflict range (link)? The general process being:

  1. Read current Schema from the database
  2. Use current schema in transactions but add a READ conflict range for the schema’s key
  3. If a write happens, that transaction will conflict and you can re-read the schema from the database.

I’m not sure if the transaction failure event gives you enough information to determine it was the schema read that conflicted (as opposed to one of your other operations), but my gut feeling is that this will improve performance enough to make it worthwhile. Also, I may be wrong about how much performance overhead adding a conflicting read range incurs.

Adding a read conflict range is fairly cheap. It ends up being the case that the resolver has to do an extra check in-memory. As noted, you don’t have enough information returned to you to let you know whether it was the schema change that caused the conflict or whether it was something else, but that’s usually fine. (You can adopt a rule like every time there is a conflict, check the schema again.)

The problem, though, is that there is a gap where the schema might change that you might miss. In particular, the following might happen:

  1. You read and cache the schema in transaction 1.
  2. Someone else updates the schema in a separate transaction.
  3. You begin transaction 2. It gets a read version after the above schema update has happened. Therefore, when you commit and check to see if the schema matches, the resolver see there have been no updates to that key since transaction 2 began, and your transaction (erroneously) succeeds.

This is partially the idea behind my suggestion that if you could add a read conflict key “in the past”, you could close this gap, because now the resolver would check that the schema hasn’t been mutated since the end of transaction 1. But this also won’t work for read-only transactions, and all of your client code would have to be designed to tolerate differences between the schema and the database because schema validation won’t happen until the transaction ends.