Prefix compression in client API

Hi,

I’m wondering if there is currently any prefix compression in the format keys are returned from the storage server to the client? Assuming there is not, would there be interest in me adding such a feature? Perhaps behind a knob. For my use case, we have some large-ish queries (100s of MB) where on average something like 80% of the key is repeated in every single key returned.

Thanks

There’s no prefix compression in the serialized messages sent from storages to clients. There is an issue tracking this: https://github.com/apple/foundationdb/issues/2189. I made an attempt at this once: here is my PR: https://github.com/apple/foundationdb/pull/3076. It didn’t seem to help. I think the overhead of computing common prefixes offset the gains of having smaller network messages, at least in that PR. I think computing the common prefixes could probably be done more efficiently, especially if you can push it down to the storage engine.

I think what might be even better than prefix compression would be some kind of API where you can just remove common prefixes from the result entirely. For example, let’s say you want to treat some subspace with a common prefix as its own little database. The client is going to ignore the prefix anyway, so why send those bytes over the network at all?

I’m imagining maybe a transaction option to set the prefix to be ignored. Then subsequent reads are done relative to that prefix.

Something like

>>> tr.options.set_common_key_prefix('/longCommonKeyForMySubspace/')
>>> list(tr[:])
[('a',''), ('b',''), ('c','')]

If the actual data in the database were

/longCommonKeyForMySubspace/a -> ''
/longCommonKeyForMySubspace/b -> ''
/longCommonKeyForMySubspace/c -> ''

Interesting. Ignoring the common prefix would work nearly as well for us and would indeed be a lot easier to implement and potentially push down all the way to the storage server. We’re probably about a month away from being able to rigorously test this. How would I go about starting this? Should I create a GitHub issue first? Or get some proof this actually works before bringing it to the community.

This is something I’ve wanted to do for a while.

@andrew.noyes A full prefix check on every consecutive key pair of every set of keys costs too much CPU, but there are better ways.

If a set of keys (or things containing keys) is sorted (and in most cases in FDB this is true), there is a very cheap way to get much of the benefit for very little send-side CPU. You can compare the first key to the last key to get a common prefix to factor out, and the message format becomes roughly

numItems | commonPrefixLen | commonPrefixBytes | [itemsWithKeySuffixes...]

If the set is not sorted, then CommonPrefixLen is 0 by default, and on the decoding side that case can be shortcut to not do any concatenation to restore the keys when decompression is requested.

If the creator of the set knows that although the set isn’t sorted it does have a common prefix, then the common prefix could be initialized explicitly.

For a sorted set, there is also a way to produce the full list of pair-wise prefix lengths by starting with the first-last common prefix and bisecting the set recursively, passing each enclosing set’s common prefix as a starting point for prefix detection in its subsets. Redwood essentially does this to generate its prefix compressed binary trees.

I’m not sure how much CPU benefit there would be, but we could save some memcopies inside the storage engine by not copying a prefix known in advance. That could be implemented by simply not copying the shared prefix between the range request begin and end keys. If adding an extra seek to the last key before copying all the keys were cheap enough, we could even do that.

If we use @SteavedHams’ compressed format, that is easily compatible for storage servers that do no prefix compression. We could pretty easily add a client-side API to only concatenate bytes after a known prefix to implement @andrew.noyes’ idea. For example, the data might be:

/longCommonKeyForMySubspace/aa -> ''
/longCommonKeyForMySubspace/ab -> ''
/longCommonKeyForMySubspace/ac -> ''

The wire response would be /longCommonKeyForMySubspace/a a b c, and the client could request to ignore /longCommonKeyForMySubspace/a a b c so that the client receives aa ab ac.

@SteavedHams In what cases are keys not sorted in FDB?

It seems like adapting @andrew.noyes’ PR to find the common prefix by checking first/last for sorted keys. I won’t be able to get to that in the next couple of weeks, but I should be able to take a stab at it in the nearish future.

It’s a little more complicated than that since usually (due to internally imposed byte limits), you can’t know the end of your range without basically iterating from the beginning until you hit the byte limit.

I like this idea

There’s an existing concept of a “VectorRefPreserializer”, which tries to keep track of the serialized size of a VectorRef incrementally so that we don’t need to iterate over the vector when serializing to figure out how much memory to allocate. I think we can remove this concept entirely when trying to implement @SteavedHams idea, since it won’t really apply anymore, and was specifically introduced to try to help GetKeyValuesReply messages. I don’t think we even hit the fast path anyway due to some use of non-const methods.

I think we could instead try using serializing/deserializing hooks in GetKeyValuesReply::serialize in the style of the new MutationRef implementation: https://github.com/apple/foundationdb/blob/2efbe7443b8e984abcfd4642c9b84776d35aebb6/fdbclient/CommitTransaction.h#L109. This would be a protocol change so it’d need to go into the master branch.

This would be an example of where the creator of the set of things knows there is a common prefix to all of them and can just initialize it directly. The return data type for range read needs to be something that supports this, not just a vector of key value refs.

Due to item count and response size limits, you can’t seek directly to the last key that would be returned in most cases.

I like this, and I would also be OK with going a bit further to say that the wire protocol for RangeRead should explicitly exclude prefix bytes common to the begin and end of the range.

I’m not actually sure. Currently I don’t think any interfaces require sorted input, I just suspect that as a side effect of the source data structures most things arrive sorted. On the resolvers, if the key sets for each transaction ID are sent as separate objects then likely within each set the keys are sorted if they were sent in sorted order by the client, but otherwise they likely are not sorted.

Actually this approach uses more copies than necessary. I think specializing dynamic_size_traits for a special type that’s just like VectorRef but guaranteed to be sorted is better.

Here’s a draft implementation: https://github.com/apple/foundationdb/pull/3429/files

1 Like