Adding new APIs to specify keys as a list of chunks, allowing "zero-copy" serialization

Reviewing some code under a profiler, I’m seeing more and more frequently the same issue popping up: most of the CPU time (in app or layer code) is spent generating binary keys, and especially allocating buffers, copying bytes from one buffer to another, resizing the buffer because it was too small, etc…

The typical key generated in my apps is usually a tuple composed of a few items, that are appended to a directory subspace’s prefix. Typical code to generate a single key looks like subspace.Keys.Encode("Hello", 123) that turns into {PARTITION_PREFIX}.{SUBSPACE_PREFIX}.{02 'H' 'e' 'l' 'l' 'o' 00}.{15 7B}.

With some layers, the subspace prefix could even contain strings (at least during the initial prototype phase, I like to use strings (like “Foos”, “Bars”, “FoosByBarId”, …). You can also have a lot of large UUID-like values that are repeated multiple times (when doing index lookups under categories).

In the end, the process always look like:

  • allocate a buffer (probably too small) or take it from some pool
  • write the prefix of the subspace (a couple of bytes up to a few dozens)
  • for each item in the tuple
    • append the encoded representation of the item
    • propably at some time, have to resize the buffer (alloc a larger one, copy, throw away the previous one)
  • return a span over the buffer that has been filled
  • pass that to the fdb client API
  • throw away the buffer, or if we are lucky, know that it is safe to put it back in the pool.

Repeat that for potentially thousands of keys (in a typical index query that must then do a lookup in the original document) and most of the time is spent moving bytes in memory, before finally passing them to the C API, again and again.

One way to maybe help solve this issue, would be to be able to pass a list of buffers instead of a single buffer to most APIs that take keys. The actual key would be the concatenation of the all buffers, or chunks, that are passed.

For example, in addition of the classic fdb_transaction_set(FDBTransaction* transaction, uint8_t const* key_name, int key_name_length, uint8_t const* value, int value_length), we would be able to pass a list of (pointer, length) as the key:

typedef struct {
    uint8_t const* start; // pointer to the first byte of the chunk
    int length;               // number of consecutive bytes to read
} KeyChunk;

fdb_transaction_set_chunks(
    FDBTransaction* transaction,
    KeyChunk const* key_chunks, // pointer to an array of KeyChunk structs
    int key_chunks_length,      // length of the chunk array
    uint8_t const* value,
    int value_length
)

Under the hood, the C API could then concat all chunks back together in its own buffer before calling back into the original fdb_transaction_set (or into the same internal implementation)

In the typical case of a composite keys with a subspace prefix, I could simply cache the subspace prefix in a buffer, and only generated the end of the key in another buffer, and pass an array of 2 or 3 key “chunks”.

When working with tuples and small integers, I can simply cache all encoded versions of numbers from say 1 to 1000, and simply pass a pointer to that, instead of having to copy the bytes. The complete key could be a list of chunks, some of them pointing to cached pre-encoded buffers, while others pointing to a single buffer that contain the non-cached encoded portion of the key.

Sample layer code (in pseudo code)

KeyChunk[] keyParts = stackalloc KeyChunk[3]; // allocate on stack, so very cheap
keyParts[0] = subspace.GetCachedPrefixKeyChunk(); // cached and reused frequently
keyParts[1] = SomeEncoder.EncodeString("Hello"); // will probably allocate buf only a few bytes
keyParts[2] = SomeEncoder.EncodeInteger(123); // will probably used a cached buffer for 123 because it is a small integer

fdb_transaction_set_chunks(
    tr.Handle,
    &keyParts,
    keyParts.Length,
    &value[0],
    value.Length
);

Another use case: usually when generating a key range from a key prefix "AVeryLongPrefixABC", and when you need to inclement the last byte to generate [ "AVeryLongPrefixABC", "AVeryLongPrefixABD" ], I cannot reuse the same buffer for the two bounds, because the last byte is different. With key chunks, for the end key, I could reuse the buffer for the first key (minus the last byte) and then have an extra chunk that points to the single 'D' byte (and again here I can have a cached 256-byte long buffer with all bytes from 0 to 255 and point to that). Basically here, I can generate a range without doing any memory allocation.

Any opinions on this?

I could see the utility here, though if the FDB client has to reconstitute the list of keys into a single key, then I’m not sure the gains will be all that large. In particular, it seems like it would take CPU work that is currently in the user’s threads and push it to the network thread, which might actually degrade performance (as the C client is single threaded).

It’s possible that the bindings should do more of this kind of caching, and then they can do zero-copy operations (or only copy buffer pointers rather than buffers themselves) and only serialize everything into a single byte array right before handing the value to the FDB C client. (Kind of like Protocol Buffers ByteStrings, which, minus the caching, do much of this, especially the RopeByteString implementation).

I’m not sure if anything solves the problem of having a very long prefix and attempting to increment the last key (entirely). If much of the prefix is something like a subspace prefix and can be memoized, I suppose you could imagine keeping around something like was an array of [pointer to subspace prefix, pointer to "ABC"] and another that was [pointer to subspace prefix, pointer to "ABD"]. Then the buffers for "ABC" and "ABD" need to be allocated, but the subspace prefix buffer pointer can be shared. So it’s not a zero-allocation operation, but hopefully fewer?

Well you are right that “zero”-copy is not possible, because someone will have to merge all the chunks into a consecutive buffer, so I guess we can only settle for “at least one copy”.

Currently, the one that has to do it is the binding. But it will have to allocate a buffer, copy the fragments into it, and then call the FDB C client that will also allocate another buffer, and copy the bytes into that buffer. The intermediate buffer between the binding and the C client is a bit redundant and makes the whole thing “at least two copies”.

It’s true that the network thread would have a bit more things to do, though I’m not sure if the overhead is that great compared to getting rid of one allocation/copy: Currently, it will copy the whole key using a single memmove call, while with a list of chunks, it would need to do several memmoves on smaller chunks (with the same total number of bytes). I think that in the normal case, there could be between 2 or 3 chunks per key.

Regarding the range with long prefix, the idea would be that the ‘end’ key would become [pointer to subspace prefix, pointer to "AB" (minus the C), pointer to a single "D" in a memory sprite] so here I would reuse the same buffer as "ABC" without mutating it. The subspace prefix is already cached (and I already also cache the end key for the whole subspace in the .NET binding), the issue is with indexes that have long suffixes (128 bit guids, etc…) where the value is not known in advance.

Oh and in case this was not clear, I’m not asking that the FDB C client keep the list of chunks for the whole lifetime of the transaction: the goal is only to get rid of at least one buffer copy. The contract would be the same as before: the C client copies the bytes of the key somewhere, and once the method call to fdb_transaction_xxx(...) returns, the caller is free to reuse the buffers (as it is currently).