Best practice of storing structs. Should I pack or store fields separately

(Illarionov Oleg) #1

Hi, I work on something like Go document layer and I am not sure what is better.
Lets say I have an object, like:

type dbUser struct {
	ID    int64
	Name  string
	Bio string

Should I pack this object for example with msgpack, to store it with a single key, or it is better to store each field at a separate key and fetch all subkeys together?

It is clear for me that second approach is way more flexible, but first of all, I interested in performance. Would packing give me any significant performance improvement?

(Ryan Worl) #2

Presumably you’ll need to index at least some of those fields for individual lookups rather than doing a full range scan. A good compromise would be to store the entire document serialized (whatever works for you) and then add indexes as other keys with a pointer back to the original serialized record as the value.

Additionally, if your workload involves frequent updates to individual properties of a record rather than re-writing the entire thing, it may be worth breaking out some of the fields into their own keys to make updates involve less IO.


Data: users/data/1 => name:john, age:34
Index: users/index/age/34 => 1

(Will Wilson) #3

It’s basically depends on whether you are frequently reading/writing small parts of your documents, and how large your documents are.

If you’re making a generic document database, it’ll be hard for you to pick one strategy that works best for all of your customers. You might think about whether you can make the decision configurable, or even adaptive depending on the workload.

The FoundationDB document layer supported two modes: fully expanded, and packed into 4k blocks.

(Illarionov Oleg) #4

Thank you for this information, on more question, could not find that information anywhere by my own, Does FoundationDB provide zip archiving of data, or it is good idea to provide packing on layers side?

(Christophe Chevalier) #5

If you decide to store the object as a single blob, you will need to pay the cost of reading everything back, even just to read a single field. If you always read the entire document anyway (return it as JSON via a REST API) then that does not matter much.

If you have a few fields that are mutated frequently (last accessed timestamp, etc…) then they could be stored in a different chunk (group fields by access pattern)

Regarding the storage space, this can have some non-trivial effect:

With documents stored as a single compressed chunk:

With documents split into elementary fields:

the FIELD ids will probably be also stored in the single-chunk document, so their bytes have only moved from the value part to the key part, not changing much (maybe 2 additional bytes for the tuple encoding if they are strings,…)

But the KEY will be repeated multiple times, one per field. Depending on the size of the key (are you using 128-bit UUID for the documents? a single small int? a coumpound key?), this can take more than the document itself.

This is an important question: no, FoundationDB does not handle compression of the keys or values, this is the responsibility of the Layer.

But if you split the documents into fragments, you will not be able to compress much of anything. There are some compression schemes for short strings, but this will not offset the loss in compression ratio, and you have to repeat the key on top of that.

If you don’t care about disk space, only about latency of randomly peeking or mutating fields in large documents, this could be ok.

If you want to compress as much as possible, then storing the docment as single chunk will offer better compression, BUT not that much either: it is very likely that there is not a lot of redundant data inside a single document. The best compression ratio would be compressed batches of documents together, but this is even less efficient for reading small parts of random documents.

To help solve this, you may need to look at compression libraries that offer dictionary support. I use ZStandard for this task, because it is faster than zlib, and also supports dictionaries (and it comes with a dictionary trainer!). This article gives an overview (go to “Small data” section).

Without dictionaries, you may discover that zlib compress a little bit better than zstd. Check this issue to understand why, and see how dictionaries can help.

The ideal solution would be to have a set of “dictionaries” each with an index (starting at 1). Each document is compressed with a dictionary, which is also stored in the database. Since dictionaries are immutable, they can be cached by front-end layers.

Every N insertions or mutatations, you could have a background task that will take a random sample of the documents, train a new dictionary, perform a test compression, and if it is 5%-10% better than previous dictionary, create a new dictionary generation. This will help you avoid the scenario where a shift in the content of your data can degrade the compression ration over time.

(Will Wilson) #6

The point Christophe made about lack of prefix compression in keys is especially important. I think this is a common enough layer concern that it should probably become an optional storage engine feature someday (as discussed here: Discussion thread for new storage engine ideas).