Versioning of special key space

There are three topics discussed in this thread:

  1. Should FDB disallow range-read to operate across multiple special key spaces?
  2. How to version the special key space?
  3. Should FDB support Linux driver-like pluggable functionality so that different users (company A and company S) can have their own customized operation on a special key (or a special sub-key);

For item 1), considering the current implementation has already support range-read across multiple special key spaces and it is easy to add the functionality to disallow cross-special-key-space range read, can we have this more general implementation first and then decide if we should disallow users from cross-special-key-space range read? (It will be better to open a new topic for this particular problem.)

If we assume FDB users are sophisticated, they can avoid doing cross-special-key-space range read on their end.

For item 2, I feel there is less disagreement.
An old API version probably should not by default support the functionalities in new API version for both performance and compatibility reasons as discussed above.

For item 3, it is probably good to have if it is ever needed.

To help us reach an agreement, can @zjuLcg summarize the pro. and cons of each solution for each of the three items in a shared doc?

About “Should FDB disallow range-read to operate across multiple special key spaces?”.

In case we cannot reach an agreement on this, I think it is better to disallow range-read across different special key spaces for compatibility reasons: if we allow it in the first version of this feature and some users use the feature, these users will be impacted when we disallow the cross-special-keyspace range reads.

The implementation can still be based on the PR 2662 and add restrictions.

Why not specify the space as \xff\xff{version} and allow range reads but to explicitly ignore data returned that is above your current version or to only do range scans that end with your version? Then features can be introduced with each version with no concern.

This is equivalent to the initial proposal, where key ranges would not change across api versions. But it also adds the questions whether something like \getRange('xff\xff630', '\xff\xff700') should be legal or not? Argumentation would be the same as above: it is unsafe and nobody should probably ever do that, but it would be consistent with how fdb works for the normal key space.

@markus.pilman You mentioned that in SQL DB world, DBA can use SQL to view and control DB’s state and status, do you have a pointer to the functionalities of an existing DB?

I’d like to have a concrete idea of how the problems (both cross-module range read and versioning) were handled in the other DBs.

Sure. here are some examples for popular DBs (I am not familiar with all of them, but I hope these are the correct documentation links):

Btw: this also seems to be the case for at least some NoSQL stuff, for example MongoDB and Cassandra.

Edit: Almost forgot the most important one:

One thing I’ve been wondering and this list makes me think of again is what kind of functionality are we intending to be exposed through this API? If it’s only for reporting various statistics of the system, then I wouldn’t really have any qualms with reading the whole thing. That seems to be mostly what is showing up in these other systems based on these links from a quick look.

For some reason I was (maybe mistakenly) under the impression that this key-space was intended to allow exposing arbitrary operations to be performed, such as the examples I gave recently about fetching a transaction’s versionstamp or reading a key. That’s part of the reason why I was a bit hesitant about allowing the operations to be performed in the same query, as they could very well all have different semantics and costs. If that’s not what this key space is intended to support, though, then it might explain our disconnect.

1 Like

I mean we currently are planning to do some of this stuff - for example the key conflict set is planned to be exposed to this. But for the sake of argument, let’s focus on the versionstamp for now.

I think part of the disagreement is also that we don’t agree how such a thing should be exposed. I think we could very well have a key \xff\xff/transaction/this/versionstamp that doesn’t exist if the transaction didn’t commit and will just appear after the commit. If I understand you correctly, you don’t agree with that (why is not yet clear to me).

Right, well we could define it that way, and that might very well be a reasonable choice for this function, but in a general purpose API there’s no reason one would have to define it that way. Instead, someone may want to write the function similar to the way get versionstamp works now, and then we’d be returning a future that blocks.

So it’s not so much that I disagree with implementing get versionstamp that way (although there may be reasons not too, such as for when a layer gets access to a transaction while it’s running and wants to know the versionstamp, but is itself never notified of the commit). It’s that I assumed someone writing a function here could choose the semantics they wanted depending on what the needs of their operation were, and if they chose to return a future that blocked until some condition was met that would pose an interesting problem that we’d have to be sure to handle.

I’m also still not super convinced of the usefulness of being able to run random operations like these in the same request that’s also trying to do statistics collection, etc, but if we do choose to run multiple operations at the same time via a range read, then the semantics of that should be clear and reasonable. Something like bounding the time of the request and omitting results that don’t complete handles the blocking future case (as long as you’re willing to wait for the timeout potentially every time). I’m not exactly sure what the right answer is for errors, but we can probably come up with something reasonable.

And actually, that makes me think of another operation here which could be useful and has been requested at times, which is a future that returns when the transaction has been committed. That’s one where I think the only way to implement it would be by returning a blocking future. As described above, though, if we allowed scanning across such an operation we could have it be empty in the range scan (or maybe an error, depending on how errors work).

This is imho what watches are for. Though currently getting furtures in this special keyspace is currently not possible, but I think it would be reasonable to allow for this.

Ah I see - and I think with the current framework this would be possible. Though we could define it as invalid in which case it would be a bug if someone would do that.

As in watching the special key? That’s an interesting idea, seems like it could work.

I was thinking about this a little more, and the fact that watches can’t return any value unfortunately might limit the usefulness of this approach. It means that for any operation where we did want to return a value that depends on the transaction it is called on (such as with get versionstamp), we would have to also stash a copy of the transaction with the watch future in order to call the operation again. However, in some contexts this is not actually something that can reasonably be done. A good example is with the Java retry loop:

CompletableFuture<Void> ready = db.run(tr -> {
    // Use transaction to do something
    CompletableFuture<Void> versionstampReady = tr.get(...);
    return versionstampReady;
});

In the above code, if we did save off the transaction that was handed to us in this loop, it gets automatically disposed when run finishes, which makes it subsequently unusable.

Now in the above example, one might be able to avoid using this construction and instead manually write the retry loop to defer disposing until later, but this is less feasible if what’s being run inside the transaction is some other layer code that wants to know the versionstamp. Ideally what you’d want in this case is that the layer can do what it needs to do without any special coordination with the caller. To make this work, though, you’ll need agreement between both in order to dispose the transaction, and that’s going to make writing and using the layer pretty awkward.

1 Like

I didn’t quite get this and got confused at why layer code cannot get versionstamp at its own layer. Do you happen to have an example to further elaborate this? Thanks!

Yeah, the problem I’m describing is that our only signal just tells us that the versionstamp is ready, but to read it we have to reuse the same transaction after it has committed, even though we may not control the lifetime of it. So for example (in sort-of Java), if we have a transaction that calls into a layer to insert some data:

db.run(tr -> {
    layer.insertSomething(tr, ...).join();
    return null;
});

And that layer wants to use versionstamps in its keys and then cache the versionstamp:

void insertSomething(Transaction tr, ...) {
    insertKeyWithVersionstamp(tr, ...);
    CompletableFuture<Void> versionstampReady = tr.watch("\xff\xff/versionstamp");

    // This won't complete until after the commit, and I think it is unsafe unless we control the lifetime of tr:
    // versionstampReady.onComplete((_) -> cacheVersionstamp(tr.get("\xff\xff/versionstamp").join()));
}

With our current getVersionstamp function, it actually returns the versionstamp in the future, and so can safely do this without needing the transaction:

void insertSomething(Transaction tr, ...) {
    insertKeyWithVersionstamp(tr, ...);
    CompletableFuture<byte[]> versionstamp = tr.getVersionstamp("\xff\xff/versionstamp");

    versionstamp.onComplete((stamp) -> cacheVersionstamp(stamp));
}

There may be another way to formulate this operation that would more closely match the current getVersionstamp function, but if not then this illustrates something that may be hard to implement in the proposed scheme. And maybe this doesn’t fit into what the special key space is intended to provide, which would be ok. In that case, we should try to understand what we intend to support and not.

1 Like

If we ignore for a second that this feature already exists (and so some code already depends on this), I don’t think this is generally a very important feature. A layer can easily build abstractions around it.

Now I am not suggesting we should remove the existing interface and I am also not suggesting we should never add anything to the C API. IMHO there’s even some cases where having both, a C function and a special key would be useful.

But to be a bit more formal, I propose that the following should be in this special key space:

  1. Information about the status of the system - \xff\xff/status/json acts as precedence here (and it just makes sense). In the future I would like to give the clients also some access to some internal status (like read/write hot ranges, data distribution metrics etc). This is important for us as we do a lot of client side throttling.
  2. Status about the client and about the transaction. Imho it makes sense to let a client for example access the read and write conflict sets through this. We also added the conflict sets there.
  3. Configuration. IMHO for writing there should be a knob to guard against accidents. Configuration changes are done through transactions in fdb anyways, so a transaction object is imo the right place to add this. For reading one could argue this falls into the category above.

Everything else is probably better served through an API. There are some things (like network options) which could be interesting candidates but are a bit awkward to use like this.

In order to not block work on this, I would propose the following:

  1. For now we version this keyspace as described in my original post. I think this makes sense as it is the most conservative way of versioning and therefore it will be possible to change in the future.
  2. Scanning across modules is a bit problematic as we didn’t define what module means (and everyone had some intuitive understanding). So I would propose the following (conservative) compromise:

For any query over a range [a, b), it has to be true that:
(i) if \xff\xff is either a prefix of a and b or it is not a prefix of either.
(ii) If \xff\xff is in the range, there is a string x for which it holds that [a, b) is sompletely contained in the range ["\xff\xff/x/", "\xff\xff/x0")

So (i) basically says that you can’t query across normal and special key space and (ii) defines that your query has to be within a directory.

Can we agree on this for now?

1 Like

There was some interesting discussion about this at Implementing VersionStamps in bindings - #7 by KrzysFR, and from that I gathered that the ability to access the versionstamp from a layer like this was something people cared about. It’s also worth noting that even the current get_versionstamp approach was causing some ergonomics issues there. Certainly it’s the case that we could build abstractions that would support being able to access this data, but if we decide to do something like this in the future I think it’s important to consider what these abstractions will actually look like and whether they are going to be unpleasant for the people writing and using them.

These all sound good to me. Out of curiosity, how do the conflict sets work? Do you need to read this space after the commit to get an answer? I’m ok with that approach, btw, at least until someone tells us if doesn’t work for them.

Sounds good.

I think this sounds fine, though if we intend to restrict what’s in the space to the things in the list you described above and we aren’t blocking anything until post-commit, etc, then I feel less strongly that (ii) (or some alternative) is needed. It might be good to have if we expect our definition of what’s in here to change such that we might have to add this restriction later, though.

The current implementation is that all conflicting-keys information is saved in TransactionInfo in the Transaction object.
Every time the transaction’s commit failed with conflicts, we update the info here.
In particular, you do not need to read it after commits. You will get an empty result from getRange if no conflicts have happened.

So if I understand correctly, you do need to read it after the commit in order to get a non-empty answer? I’m mainly trying to distinguish this from the get_versionstamp model that allows (well, requires) that you call it before commit to get an answer after commit. I’m personally ok with this approach, just want to make sure I understand.

1 Like

Yes. And thanks for the clarification.