We just finished implementing a Log Layer that handles compaction of large blocks of log after a while (to benefit from better compression ratio), and this is definitely non-trivial to do ! (took 3 full days of work, from scratch).
Our log Layer does support efficient insertion of small batches of logs into a “incomplete” arena, and when there is a sufficient quantity of logs, compacts a large batch of logs into a single chunk that goes into a “complete” arena, taking advantage of better compression ratio. Also, “complete” chunks are immutable, and can be cached as long as readers want (while incomplete chunks are not to be cached)
So something like:
(..., 'counter') = (total_log_count, incomplete_log_count)
(..., 'complete', ID0000) = compressed[chunk[1000 logs]]
(..., 'complete', ID1000) = compressed[chunk[1000 logs]]
(..., 'complete', ....) // <-- new compressed chunks are appended here
(..., 'incomplete', ID2000) = chunk[2 logs]
(..., 'incomplete', ID2002) = chunk[10 logs]
(..., 'incomplete', ID2012) = chunk[1 logs]
(..., 'incomplete', ID2013) = chunk[5 logs]
(....,'incomplete', ....) // <-- new logs are appended here
of course, ‘counter’, ‘complete’, ‘incomplete’ etc… are not stored as strings but as small integers to reduce key size!
The counter
key is used to decide whether or not adding X more logs to the incomplete arena would go over the self-imposed limit of 1000 logs, and compact all the partial chunks into a single (or more!) compressed chunk. It will also be used by subscribers to create watches and be notified when stuff happens on this log.
Readers can use the fact that compressed chunks have the same ID as the first log they contain, so if they want to read log #1234 to log #1304 (which are compressed) GetRange(Last_Less_Or_Equal(ID1234), FirstGreaterThan(ID1304))
will return only the single chunk ID1000
which contains all the rows I want.
On paper, this looks simple to implement, especially the writer:
- Maximum Value Size of 100,000 bytes, so a
compressed[chunk[...]]
cannot take more than ~100KB, which when compressing if very difficult to target
- the trick of splitting the chunks into multiple keys with
,0
, ,1
suffix is not possible here, because it breaks the invariant of 1 chunk = 1 key that is exploited to do efficient range reads of logs (using key selectors will not be possible if we don’t know how many ‘parts’ a chunk is split into, and would require additional reads, or to read a lot more data than needed)
- Maximum Transaction Size of 10,000,000 bytes, which means that bulk insertion of existing log files is also an issue (need to reset the transaction every N batches that are unpredictible in size)
- the issue is: do some work to pack a chunk, check the final size, and discover that it will be more than the budget per transaction, so throw away everything, rollback the state to before, commit, reset, etc…
- We chose to have the writer responsible of adding new logs and compacting the data in a single transaction, so that we don’t need to have any other background thread (that could fail, slow down, etc…).
The reader is less complex:
- spin two concurrent GetRange(…) on both subspace with
Last_Less_Or_Equal(ID)
and merge both results into a single list of results.
In the end, this is doable, and it works well, BUT this is somewhat complicated, with two nested loops
- outer loop that spins new transactions and do “up to X MB of writes” per transaction to stay under 10 MB budget
- inner loop that has to decide how many logs can fit into a a chunks that after compression will not exceed
100 KB
- inside the inner loop, some complex code that must decide to pack or not, and take into account leftovers from previous incomplete appends, etc…
I was initially planning to use VersionStamps for ID, but since the writer has to decide to pack or not pack, it needs to know how many incomplete logs are already there, so since it needs to read at least one key, it can also read the counter and create nicer looking sequential intergers.
I could have done a non-deterministic writer that only attempt to pack on some probability (1 in 10?) because I don’t care if I have a little bit more than 1000 unpacked logs. BUT there may be a subtle issue with this:
- if the writer takes the non-deterministic path of packing the logs, and there’s some kind of bug (in the code, due to the more complex query, timeouts etc…) it will retry
- if will then then take the no-packing code path which will not fail.
After a while, I will end up with all writers ending the retry chain with a non-packing result, and will end up with a huge ‘incomplete’ arena that is less than optimized (especially if the rest the code assume that it will be ~1000 logs only and uses some O(N) algorithm on that).