Hi, I am a software engineer for Snowflake Computing working on FoundationDB. My understanding of the calculation of the storage queue size is that it is based on pessimistic estimates. When a mutation is added to a storage server’s mutation log, the size of the storage queue (bytesInput - bytesDurable, as reported to the ratekeeper) is incremented by mvccStorageBytes(m):
This accounts for 8 128-byte PTree nodes being allocated for every mutation. However, this looks to be a worst case scenario, because each PTree insertion does not necessarily require the allocation of 4 128-byte nodes (even though VersionedMap::overheadPerItem == 128*4). Furthermore, applying each mutation sometimes only requires one insertion into the PTree (e.g. in the case of a ClearRange mutation), but the mvccStorageBytes calculation accounts for two insertions.
I ran some tests tracking the number of allocations of PTree nodes in the storage queue under a variety of workloads, and there were significantly fewer allocations than what bytesInput reports. Is this a correct interpretation of how bytesInput is calculated, and if so, would it be safe to change to reporting a more exact storage queue size?
I haven’t looked into the specifics of the mvccStorageBytes accounting, but I can say that bytesInput accounts for the memory used by each mutation as well as some extra overhead per version. It should be possible to make the version overhead relatively insignificant in a test by having each commit contain lots of mutations, but just make sure that you are accounting for that in your tests.
I think if you are able to provide a better upper bound on the actual memory being used that is cheap to compute, then that sounds like a good change. I do recommend using an upper bound, though, because otherwise you may find that the storage server behaves poorly under some types of workloads.
If you do make a change like that while keeping the same queue sizes, it would be somewhat similar to increasing the size of the queue in the current implementation. I don’t have much experience running larger storage queues, but I’m not aware of any reason why that would cause problems. If you wanted, you could experiment with it and make sure everything still performs well when the queues are full.
Thank you. The tests we ran kept accounting for the version overhead, and only changed the tracking of the number of versioned map nodes allocated. We will make sure that we continue to track an upper bound on storage queue size, and never underreport the size.