Reviewing some code under a profiler, I’m seeing more and more frequently the same issue popping up: most of the CPU time (in app or layer code) is spent generating binary keys, and especially allocating buffers, copying bytes from one buffer to another, resizing the buffer because it was too small, etc…
The typical key generated in my apps is usually a tuple composed of a few items, that are appended to a directory subspace’s prefix. Typical code to generate a single key looks like subspace.Keys.Encode("Hello", 123)
that turns into {PARTITION_PREFIX}.{SUBSPACE_PREFIX}.{02 'H' 'e' 'l' 'l' 'o' 00}.{15 7B}
.
With some layers, the subspace prefix could even contain strings (at least during the initial prototype phase, I like to use strings (like “Foos”, “Bars”, “FoosByBarId”, …). You can also have a lot of large UUID-like values that are repeated multiple times (when doing index lookups under categories).
In the end, the process always look like:
- allocate a buffer (probably too small) or take it from some pool
- write the prefix of the subspace (a couple of bytes up to a few dozens)
- for each item in the tuple
- append the encoded representation of the item
- propably at some time, have to resize the buffer (alloc a larger one, copy, throw away the previous one)
- return a span over the buffer that has been filled
- pass that to the fdb client API
- throw away the buffer, or if we are lucky, know that it is safe to put it back in the pool.
Repeat that for potentially thousands of keys (in a typical index query that must then do a lookup in the original document) and most of the time is spent moving bytes in memory, before finally passing them to the C API, again and again.
One way to maybe help solve this issue, would be to be able to pass a list of buffers instead of a single buffer to most APIs that take keys. The actual key would be the concatenation of the all buffers, or chunks, that are passed.
For example, in addition of the classic fdb_transaction_set(FDBTransaction* transaction, uint8_t const* key_name, int key_name_length, uint8_t const* value, int value_length)
, we would be able to pass a list of (pointer, length)
as the key:
typedef struct {
uint8_t const* start; // pointer to the first byte of the chunk
int length; // number of consecutive bytes to read
} KeyChunk;
fdb_transaction_set_chunks(
FDBTransaction* transaction,
KeyChunk const* key_chunks, // pointer to an array of KeyChunk structs
int key_chunks_length, // length of the chunk array
uint8_t const* value,
int value_length
)
Under the hood, the C API could then concat all chunks back together in its own buffer before calling back into the original fdb_transaction_set
(or into the same internal implementation)
In the typical case of a composite keys with a subspace prefix, I could simply cache the subspace prefix in a buffer, and only generated the end of the key in another buffer, and pass an array of 2 or 3 key “chunks”.
When working with tuples and small integers, I can simply cache all encoded versions of numbers from say 1 to 1000, and simply pass a pointer to that, instead of having to copy the bytes. The complete key could be a list of chunks, some of them pointing to cached pre-encoded buffers, while others pointing to a single buffer that contain the non-cached encoded portion of the key.
Sample layer code (in pseudo code)
KeyChunk[] keyParts = stackalloc KeyChunk[3]; // allocate on stack, so very cheap
keyParts[0] = subspace.GetCachedPrefixKeyChunk(); // cached and reused frequently
keyParts[1] = SomeEncoder.EncodeString("Hello"); // will probably allocate buf only a few bytes
keyParts[2] = SomeEncoder.EncodeInteger(123); // will probably used a cached buffer for 123 because it is a small integer
fdb_transaction_set_chunks(
tr.Handle,
&keyParts,
keyParts.Length,
&value[0],
value.Length
);
Another use case: usually when generating a key range from a key prefix "AVeryLongPrefixABC"
, and when you need to inclement the last byte to generate [ "AVeryLongPrefixABC", "AVeryLongPrefixABD" ]
, I cannot reuse the same buffer for the two bounds, because the last byte is different. With key chunks, for the end key, I could reuse the buffer for the first key (minus the last byte) and then have an extra chunk that points to the single 'D'
byte (and again here I can have a cached 256-byte long buffer with all bytes from 0 to 255 and point to that). Basically here, I can generate a range without doing any memory allocation.
Any opinions on this?