Are all of the transactions hitting the same subspace? I’m asking because if there is effectively one giant change queue, you can run into issues where all of the writes are hitting one team of storage servers, then data distribution kicks in and routes all writes to another team, put then that team gets hot, etc. Similarly, if all of the readers are hitting the same keyspace, then you might have exactly the same hot shard problem, but this time at read time. This will limit the theoretical qps, but if you are writing to a bunch of different queues, then that might be fine.
I’d also say that the most compelling reason (in my mind) to do this would be because you have some dataset and you want to be able to read the latest data and you want to perform other queries on it (or really need it durable for some reason). This way gets you transactional updates between your change feed and your database, which can be nice, but if you never use the database, then maybe some other queue solution would be better.
One issue you can run into using timestamps as the mechanism for grabbing changes is that if you can sometimes miss data. Even ignoring instances of clock drift, you can do something like:
- Some DB-reader has done a scan and has read everything through timestamp t0.
- At timestamp t1, one transaction begins.
- At timestamp t2, another transaction begins.
- The second transaction commits at timestamp t3, but it writes timestamp t2 (because it can’t know at what time it will commit).
- The DB-reader performs another scan, starting at t0. It will end its scan at t2 because that’s the most recent successful commit.
- The transaction that began at t1 commits, writing its data before the transaction that committed at timestamp t2.
This means that now that the reader thinks that it has everything through t2, it will never go back and grab the data from the t1 transaction.
One way around that is to use versionstamp mutations (i.e., SET_VERSIONSTAMPED_KEY and SET_VERSIONSTAMPED_VALUE) to ensure that transaction mutations get written in commit order instead of by timestamp. This closes out the gaps insofar as it ensures that later committing transactions always get written after earlier-committing transactions.
We don’t have a lot of documentation on how you would do this, but see this discussion about change feeds: Changefeeds (watching and getting updates on ranges of keys) See also this discussion about how versionstamps don’t necessarily play nicely with restoring from backup and how you might fix this: VersionStamp uniqueness and monotonicity
As to having cold data lie around, I don’t think it will affect your response times too much. The underlying storage mechanism is a B-tree, which means that the number of disk pages your query might touch is equal to the depth of the B-tree (plus any overflow pages), which might be larger with a larger data set. That being said, the storage servers have an in-memory cache of recently read and written pages, so if you are reading and writing 1% of your dataset, presumably most of it is in the cache? Also, for larger datasets, there will be more shards (as a dataset gets larger, we both increase the number of shards and the size of each shard, so it’s not like its linear with the dataset size). Having more shards can lead to a few performance penalties, though I don’t think it’s too bad. Though I’ll also admit that that isn’t my area of expertise, so if someone else knows the trade-offs of a database with a large number of shards, that would be nice to know.
Also, I guess all of the problems with having a dataset like the one you’ve described come mainly from its size rather than the coldness of some of its data. I suspect (because of caching), the coldness of most of your data probably won’t actually matter too much.