Design document of internals & storage?

(Nikhil Bafna) #1

I went over the docs in

I wanted to understand the internals more in depth - specifically around storage. My guess is that scans are fast, due to how underlying storage is designed. Also, I wanted to understand the source of the 100kb limit of value size.

Any pointers to existing docs would also be really helpful.

(Dave Rosenthal) #2

The design constraints are:

  • There should be a simple documented limit rather than a slow decent into pain as value get larger
  • The limit should be small enough that reading a single KV doesn’t represent a noticeable latency blip for the entire server
  • The limit should be high enough to not be annoying to developers
  • The limit should be high enough that the cost of reading the bytes associated with the value is significantly more expensive than the per-request overhead (this is to enable a low abstraction penalty for applications storing large data across many key-value pairs)

Given FoundationDB’s use of SSDs, 100K was chosen years ago as fitting those constraints.

A layer can easily build a “large-value” abstraction that supports seeking and streaming of a large values by storing, say, 64K at a time in multiple keys.

An interesting improvement to FDB would be to have it’s native API support streaming large values, but it would probably be a breaking API change.

(Nikhil Bafna) #3

Thanks for the super quick reply!
Makes sense and makes use-case definition much more explicit.

(Evan Tschannen) #4

Dave did a good job explaining the basic consideration behind our value size limit.

Although it does not go into too much detail about low level storage, I added some more information about our architecture in a different thread that might be interesting.

(Ben Collins) #5

I’ll also point you at the “blob” design recipe. It shows a very simple way to use more than one key value pair to store a larger value. It does not, however, get a developer past the total transaction size limits. To accomplish this, and do it in a safe manner, one would have to:

  • choose an identifier for the object (see the high-contention allocator for a sophisticated example)
  • spread writes across more than one transaction and
  • use some level of indirection as a final step to register the written values as the content for the chosen identifier