Way to rename key

This is a discussion moved from GitHub: https://github.com/apple/foundationdb/issues/886

In summary, the question is: how to rename a large set of keys with small performance impact.

The most recent question is:

The case we are using FDB is like this:
We have a transaction (not the transaction in FDB), which may write large amount of k/v into FDB. Instead of writing them once, we write them in batch and commit at last. Before last commit, assume this transaction has a txn_id, the key may be named as xxx_txn_id. When committing, we get a commit time T, then we hope to change the key name from xxx_txn_id to xxx_T.
Do you have any suggestion on this case?

I don’t know there exist a way to do the transformation efficiently.

If you don’t have to put the time T as the suffix of the key, you can let the client to get xxx and T in the same transaction and concatenate them together. It can be solved as follows: Use a customized key for the non-FDB “transaction”. The key’s value is either the txn_id or time T. When the non-FDB transaction finishes, you change change the key’s value. (You need to use one bit in the value to distinguish the txn_id or time T.

I guess the case you described is to solve a higher-level problem. There may exist better solutions to that problem than the problem you describe. Would you mind sharing it?

1 Like

I’m not certain exactly what you’re trying to do, is the following a correct interpretation?

You want to write some data to FoundationDB with a suffix on each of your keys. These writes are done across multiple FDB transactions, and initially, you do all of your writes with the suffix _txn_id. During or after the last committed transaction, you decide on a new suffix _T that you want to apply to all of the previous writes from this batch of transactions.

If that is correct, then I think you have the following options:

  1. Wait until _T is known to do any writes to FDB, if your use-case allows.
  2. Use indirection to map your ID to the key suffix (or a prefix). For example, you could use the directory layer to create a directory /txn_id to write your keys into and then rename the directory to /T when you know its value. This works because the keys are stored in some generated prefix, while the directory names map to the prefix. Renaming the directory only updates the map while leaving the data intact. You needn’t use the directory layer for this, though, if you have some other indirection scheme you want to use.

I’m not sure if there are any other good general alternatives, though depending on the details of what you are trying to do, there may be other more specific options that could solve your problem.

1 Like

Thanks @mengxu and @ajbeamon.

We are implementing a mvcc with FDB. The case is similar to @ajbeamon 's interpretation.
We store the the data into FDB and the key format is “blockname_timestamp”. When reading, as FDB sort by key, we first use blockname to do the filter and get the k/v starting with blockname from FDB, then we use the timestamp to get the specific version we want (like the latest version).

The option 1 mentioned above is fine, we buffer the k/v and commit them with the commit time at last.
But when the buffer is full, we may still need to do the split write, then we need to use some temp timestamp like the txn_id. As @mengxu suggested, I think one way to avoid the rename is we can put the commit time into the value. But one concern is , for the same “blockname”, we need to further search for the specific version we want. I am thinking if the commit time is embed as the suffix the key (blockname_timestamp), then the key/value get from FDB is already the orders we expect (order by blockname first, for the same blockname, order by commit time), thus the search for the specific version is faster. Please correct me if my understanding is wrong.

For the option 2, if we still want to first prune by blockname, is it possible?

Based on the additional details you’ve provided I don’t think option 2 as @ajbeamon described it would work for you because you would not be able to search for blockname globally with one read. You would have to search each /T directory for blockname separately, as the keys and values for each externally defined transaction would be located in its own directory (initially with a temporary directory name and then being renamed to the final name).

1 Like

An indirection could still be possible, it just probably wouldn’t use the directory layer since it is prefix based. For example, you might be able to create an index that maps (blockname_)timestamp to a transaction ID. First, you’d read the index to get the transaction IDs you want to read, then you’d read from the keyspace where you stored the data by transaction ID. If you want to scan across multiple versions at once, this would work best if you can have a transaction ID that is ordered, but if not it probably doesn’t matter.

@ajbeamon

The transaction ID is ordered, the commit time is also ordered. The transaction id is known when it starts but the commit is only known till commit. We make the min value of transaction is larger than the max value of commit time so we can control the invisibility of uncommitted blocks.

The reading is give a time T, then find related blockname with the latest commit time which is less than T. We actually don’t use transaction id, just treat them as temp id for differentiate keys with different blockname.

For any given block_name, how many unique transaction ids do you expect will exist at any given time? Are they eventually cleaned up?

If there is some cleanup so the MVCC scheme is only mean to cover a rolling window of time, then I suggest this key-value schema, described here in python-ish tuple form:

("blocks", <block_name>, <transaction_id>) => <value>
("commits", <transaction_id>) => <commit_time>

To find the latest version of a block as of a time, you first range scan the “blocks” space to find all transaction IDs for the block name, and then look up each of those in parallel in the commits space to see which of those were committed and when then were committed, and then you will know which value you should.

Transactions that are not yet committed will not have a “commits” space entry, so those transaction ids will be ignored, so it is fine to use many different FDB transactions to write this data in chunks.

A big downside of course is that to look up a block name you must read every version of the block name record for every transaction in the MVCC window. If this makes this solution insufficient, let me know, I may have some other ideas.

Thanks @SteavedHams.

Yes, for different versions of given block_name, they will eventually been cleaned up, the number of versions at a given time depends on the what they are been accessed by other active transactions. If no active transactions, only the latest version will be kept, the old ones will be removed by garbage collector.

For your given solution, just one question, why not directly put the commit_time into the value? Left it null or INFINITY before commit and set it after commit.

For the downside, yes, I agree with you all versions of any glock name will be fetched when reading.
For reading, we usually do a range query. For example, If we have 100 blocks in FDB, they are from block001 to block100, the query is like find blocks with id > 50, and commit before time T_n. Ideally, if all things can be pushdown, of course, only the latest version (less than T_n) of blocks 50-100 are fetched from FDB. If it is like this, that will be great. To my understanding, even we can commit all blocks to FDB once with the commit time, if the key format is blockid_commitTime, we still need to fetch all versions for the related blocks, as the FDB should use prefix to do the query, right? All keys larger than string block50 will be read out of FDB.

FDB does not have predicate pushdown, so you will need to either issue a parallel range scan for each block number, or, if you do not know what block numbers actually exist in a large possible range, you could maintain another key space that is essentially a unique index on block names that exist so you can scan that space first with one range scan to discover the unique block names to use for the parallel range scans.

For your given solution, just one question, why not directly put the commit_time into the value? Left it null or INFINITY before commit and set it after commit.

I was assuming that you would want to write block name records from a transaction before the commit time of the transaction is known, as I thought was the case in your original question. This indirection would allow you to do that.

However, if you don’t mind the cost of going back to write each KV pair a second time, then placing the commit timestamp in the value is not what I would recommend.

I suggest the following, which also would help mitigate the issue of scanning all versions of a block with the MVCC window. The schema is:

("blocks", <block_name>, "transactions", <transaction_id> => <value>
("blocks", <block_name>, "timestamps", <commit_timestamp> => <value>
("commits", <transaction_id> => <commit_timestamp>
("transactions", <transaction_id>, <block_name>) => null

So initially, you have block updates and you have a transaction_id but no commit_timestamp.
For each block update, you set

("blocks", <block_name>, "transactions", <transaction_id>) = <value>
("transactions", <transaction_id>, <block_name>) = null

Each FDB transaction you commit can cover 1 or block updates. Once all of the block updates are committed successfully, then set

("commits", <transaction_id>) => <commit_timestamp>

And the write logic ends here. Then, one or more instances of a background cleanup job can then scan the “commits” keyspace to find committed transactions, and for each one of those scan the “transactions” keyspace to find the blocks updated in that specific transaction. For each of those block updates, do this:

  • read ("blocks", <block_name>, "transactions", <transaction_id>)
  • clear ("blocks", <block_name>, "transactions", <transaction_id>)
  • set ("blocks", <block_name>, "timestamps", <commit_timestamp> => <value> using the value read above and the commit timestamp obtained for the transaction earlier
  • clear ("transactions", <transaction_id>, <block_name>)

These changes must be done transactionally for a specific block update, but you can use many FDB transactions, each covering 1 or more updates.

Once all of the block updates have been “converted” from the “transactions” space to the “timestamps” space under the <block_name>'s key space, then clear the ("commits", <transaction_id>) key.

With all that done, to look up the value of a block at a given timestamp, do the following:

  • Using a reverse order getRange() with a limit of 1, directly read exactly the block version you want from ("blocks", <block_name>, "timestamps")
  • Using a range read, read ALL of the versions of the block from ("blocks", <block_name>, "transactions", and look up the commit timestamps as I explained earlier.
  • Return the appropriate newest version at or under your target

The idea here is while you’re still doing 2 writes per block update, the second write (and a few other mutations) is improving the efficiency of reads by moving block update records into a timestamp-ordered keyspace containing only committed updates.

By using a background cleanup job whose work queue is stored in the database, you don’t have to worry about a single process dying in the middle of retracing its block updates to put the commit timestamp somewhere in the database for each block update.

1 Like

@SteavedHams Very detailed reply, thanks a lot.

Do the 2 writes here mean the the first write on “transactions” space and the second write on “timestamps” space?

Is the background job here inside FDB? Even it is background, the application should be aware of the finish of background job, right? Otherwise, how does the application know the block has been already moved to “timestamps” space?

One more question here, the reverse order read and the range read, are these 2 reads trying to combine all versions located in both “timestamps” and “transactions” spaces?
As all have been done here (including background job), all versions for any block should be already moved to “timestamps” space, why still need to the range read from “transactions” space?

Yes, I assume (perhaps incorrectly) that the values in the block update “transactions” and “timestamps” spaces will be on the larger side, but technically this data model does have 3 writes and 1 clear total for each block update event. (And in some sense number of edits affects throughput moreso than value bytes that are just along for the ride.)

Not in FDB, no, this would just be another loop of code you have as part of your app that you would want to keep running all the time, with at least once instance (there are ways to run multiple instances and minimize their conflicts). Remember that the application will always see a consistent view of the entire keyspace within each transaction, so when the app does the lookup logic for a single block_name it will “see” all updates on that block within the MVCC window in the two keyspaces, “transactions” and “timestamps”, but it will only have to read one key from “timestamps” while it has to read all of the keys from “transactions”. As cleanup occurs, reads get faster.

Yes that’s exactly what it’s doing, and my assumption here is that any given moment you have a bunch of blocks being updated, a much larger number of blocks that have not been updated in a while, and you have reads going on that want to read blocks at a recent timestamp. This schema will allow reads to see very recent changes with some “read amplication” (reading some things you don’t really need) but for blocks that have not changed recently the cleanup will have coalesced all of their updates into the timestamp space where the correct versioned answer can be read with a getRange() and the scan of the transactions space for the block would yield nothing so there would be no secondary lookups for commit timestamps and no combining of the two types of answers.

1 Like

Thanks a lot @SteavedHams. Understand now, I will try it.

For the mapping from transaction id to commit time, I think we can maintain it in application, then we can save 1 write.

This could work, but it’s worth noting that the “commits” space is also serving as essentially the work queue for transactions that have not yet been post-processed, so the cleanup process becomes trickier without these records.

@SteavedHams One question here, because the reading usually have a bunch of blocks to read (E.g., find blocks with block id > 50 for blocks block001 to block 100), is it still possible to use getRange() with limit of 1 to get exactly the specific version of a bunch of blocks we want from ("blocks", <block_name>, "timestamps") ?

No, each read is going to give you the right version of 1 block. You just issue a bunch of parallel reads. If your block numbers are dense, so know that all or nearly all blocks between 50 and 100 are present, then it would be acceptable to just issue reads for all of them. If the block numbers are sparse, then you will need a sort of unique index of block names as I mentioned previously.

Oh, I see it.

From my understanding, for the key format ("blocks", <block_name>, "transactions", <transaction_id> => <value>, the key should use the block_name as prefix, am I right? If like this, no matter the block numbers are dense or sparse, we can always use the block range like lower bound: block_50, up bound: block_101 to only get the blocks we want. However, it contains all versions for these blocks (block 50 to 100).

If we have a unique index of block names to get the exact block we want and do the parallel scan, we can use get range limit 1 to get the corresponding version we want for each block.

Not quite, but that is a clever idea and it will work with a simple schema change.

All keys in FDB are ordered, so the schema as I have described it would cause a range scan from ("blocks", "block_50") to ("blocks", "block_100") to return all of both the "transactions" and "timestamps" subspaces for all blocks in that range.

However, you could change the tuple ordering to this instead:

("block_transactions", <block_name>, <transaction_id> => <value>
("block_timestamps", <block_name>, <commit_timestamp> => <value>

So now the answer is yes, you can do a single range scan of from

("block_transactions", "block_050") to ("block_transactions", "block_100")

to get all of the pre-coalesced block updates for any of the targeted blocks for all transactions all at once which is a very nice optimization.

Here’s an additional optimization: Instead of keeping a separate unique block name index that you must scan, you could place an entry in "block_transactions" for each block name using a dummy transaction id such as 0, and simply never delete this record (unless the block is removed from the database). These placeholder records would act as the unique block name index, so you would also discover all of your block names with that same single range read described above. However there is a downside here in latency: Since the index scan will be intermixed with potentially a much larger number of records, it will take more time before your application can launch all of the parallel "block_timestamps" reverse getRange() operations.

1 Like

Thanks @SteavedHams. Your idea helps me a lot.