i know the FDB documentation clearly states that Value sizes can’t go beyond 100 kb but i see a client knob exposed which allows me to tweak it. I grepped through the codebase expecting to see some kind of check corresponding to the upper limit but couldn’t find any, which got me thinking why the 100kb limit exists in the first place.
AFAIK, the SQLite-storage engine shouldn’t need to worry about the sizes of the values since that is anyways going to go through an IO op.
Now there are 2 possible reasons i can come up with for limiting the Value size, none of which can explain the 100 kb limit for me though -
Since the various processes themselves communicate through an RPC mechanism, whcih involves network , and network latencies are unpredictable, we choose to have an upper bound so that we can simulate worst case scenarios and get a handle on the expected behaviour. The rigorous simulation/testing framework which FDB has, alludes my primary bet to this.
If we allow for larger value sizes, eg : what i think MySQLs and DB2s of the world do for supporting BLOB data-typed columns, where it isn’t possible to fit the entire contents in a single node , they allow for some kind of pointer indirection, which leads to an extra IO to fetch that page/s from disk. This extra IO op can build up significantly when we move to the scale which FDB intends to support. a 4-level BTree with a branching factor of ~1k can store a billion records(keys) and thus that extra IO can be disastrous.
firstly, I wanted to understand if any of my above assumptions is correct wrt the limit to value size. I have seen this come up repeatedly on various forums for different k-v stores , but none provides a satisfactory answer.
Secondly, why the 100kb is what befuddles me ? could it have been 50kb or could it have been 500kb ? was this the result of an experimental simulation for varied value sizes and this performed the best ? if yes, can anyone please share what were the invariants of such an experiment ?
Thirdly, are there any issues with increasing the MAX_VALUE_SIZE knob and correspondingly the MAX_TRANSACTION_SIZE knob to allow for greater value sizes , if i’m okay with taking a hit in performance ?
I think your observations are correct. From a storage level one problem is that sqlite handles large values quite poorly (it just has one overflow page per value - which is not the best way of supporting these). I think @SteavedHams fixes this, so that should be better now.
Another problem of large values (and therefore transactions) is tail latency: if you have a very large transaction, other transactions that run roughly at the same time will also suffer. This is because fdb needs to serialize transactions in several places.
IMO a better way of writing large values is to abstract this in the layer (the record layer already might provide this functionality somewhere - though I am not sure). You can break your value down into multiple kv-pairs and you can even write it in multiple transactions.
If you need to do this often I would not even do this though. Instead I would make use of a blob storage engine (for example S3). S3 for example gives you the useful guarantee that when you write a new file, the write will be strongly consistent. So as long as you don’t ever update a file, you won’t lose consistency. So you could store pointers to files on S3 and that way write without any size limits. Obviously this is not quite trivial (you also need to handle failures, garbage collection etc), but it definitiely works very well (this is roughly how Snowflake works after all ).
I don’t have strong memories of this, but I don’t recall there being a lot of experimentation around the chosen size (which isn’t necessarily to say they didn’t happen). We knew we wanted a limit, and we may have just decided that 100kb seemed reasonable enough to meet our needs and provide a useful amount of space for clients. Of course, no matter where we chose our limits, there would need to be some strategy for what to do when larger data needs to be stored.
For transaction size, we arguably chose a limit that was too high relative to what we could actually do. At least for a while, committing a transaction near the limit could stall all commits for many seconds. I think this has improved somewhat, but it is probably still a concern and you should be aware that a commit from one client could have negative impacts on the commits from others when transactions are large.
I only fixed the slack space issue that would result when many KV pairs were just large enough to overflow, requiring a dedicated overflow page and using a very small amount of it.
For very large values, SQLite stores them as a linked list of overflow pages, which in this case are 4k each, so 100k values would add about 25 additional serial read latencies for an uncached read.
The Redwood storage engine will be better in this regard as the equivalent of those reads will be done in parallel, but for the other reasons given here it’s still usually not a good idea to use very large values.
Another problem of large values (and therefore transactions) is tail latency: if you have a very large transaction, other transactions that run roughly at the same time will also suffer. This is because fdb needs to serialize transactions in several places.
Yes, completely agree, i’m expecting to take a performance hit. To be able to quantify that would be probably the first thing we’d do. Given that the frequency of such values and correspondingly such transactions is very low, do you think values which do not breach the limit( 100 kb ) be impacted ?
If you need to do this often I would not even do this though. Instead I would make use of a blob storage engine (for example S3). S3 for example gives you the useful guarantee that when you write a new file, the write will be strongly consistent.
Unfortunately , we’re not there yet in terms of cloud adoption, This is an interesting direction though and i have seen a lot of KV stores provide the blob storage in the cloud. Our use case , as i mentioned earlier doesn’t encounter ~many such cases, so probably we’re thinking of getting away with tweaking the knobs and see how it goes.
The Redwood storage engine will be better in this regard as the equivalent of those reads will be done in parallel, but for the other reasons given here it’s still usually not a good idea to use very large values.
Eagerly awaiting the same, the benefits will be pretty significant for us!