If you decide to store the object as a single blob, you will need to pay the cost of reading everything back, even just to read a single field. If you always read the entire document anyway (return it as JSON via a REST API) then that does not matter much.
If you have a few fields that are mutated frequently (last accessed timestamp, etc…) then they could be stored in a different chunk (group fields by access pattern)
Regarding the storage space, this can have some non-trivial effect:
With documents stored as a single compressed chunk:
With documents split into elementary fields:
FIELD ids will probably be also stored in the single-chunk document, so their bytes have only moved from the value part to the key part, not changing much (maybe 2 additional bytes for the tuple encoding if they are strings,…)
KEY will be repeated multiple times, one per field. Depending on the size of the key (are you using 128-bit UUID for the documents? a single small int? a coumpound key?), this can take more than the document itself.
This is an important question: no, FoundationDB does not handle compression of the keys or values, this is the responsibility of the Layer.
But if you split the documents into fragments, you will not be able to compress much of anything. There are some compression schemes for short strings, but this will not offset the loss in compression ratio, and you have to repeat the key on top of that.
If you don’t care about disk space, only about latency of randomly peeking or mutating fields in large documents, this could be ok.
If you want to compress as much as possible, then storing the docment as single chunk will offer better compression, BUT not that much either: it is very likely that there is not a lot of redundant data inside a single document. The best compression ratio would be compressed batches of documents together, but this is even less efficient for reading small parts of random documents.
To help solve this, you may need to look at compression libraries that offer dictionary support. I use ZStandard for this task, because it is faster than zlib, and also supports dictionaries (and it comes with a dictionary trainer!). This article gives an overview (go to “Small data” section).
Without dictionaries, you may discover that zlib compress a little bit better than zstd. Check this issue to understand why, and see how dictionaries can help.
The ideal solution would be to have a set of “dictionaries” each with an index (starting at 1). Each document is compressed with a dictionary, which is also stored in the database. Since dictionaries are immutable, they can be cached by front-end layers.
Every N insertions or mutatations, you could have a background task that will take a random sample of the documents, train a new dictionary, perform a test compression, and if it is 5%-10% better than previous dictionary, create a new dictionary generation. This will help you avoid the scenario where a shift in the content of your data can degrade the compression ration over time.