FoundationDB

Shards are not splitted into smaller ones


(Daniil Gitelson) #1

I’m running FoundationDB on 3 machines with 5 processes per machine, SSD2 storage, with Fault Tolerance = 1. Nothing special in configuration.

The load is like this: two keys are updated, one is inserted, so database KV data grows due to inserts.

I would expect that as data grows FDB would split it’s storage-*.sqlite files, (according to docs, as I understand, they should not be greater than 500MB), but what I see, that each FDB process’s data just grows without really splitting: I had storage-*.sqlite files for ~13Gb and growing.

Is this expected behaviour? If so, how does it correlate with max shard size?


(A.J. Beamon) #2

All of the shards that a particular storage server is responsible for are stored in the same data files. For the sqlite engine, this means that you’ll have 1 .sqlite file and 1 .sqlite-wal file regardless of the number of shards that are assigned to that storage server.

I will note, however, that it is possible for one process to have multiple sets of storage server files for various reasons unrelated to the partitioning of key space. In that case, if the files are all valid, then there will be multiple storage servers running on the same process.


(Daniil Gitelson) #3

Thanx for the answer. My concern is about mass deletion of entries: I’m creating entries like 2018.10.09.xxx (prefixing key with current date) to be able delete all ‘old’ (let say older than 5 days) keys with single clearrange and I was hoping that it will be as fast as ‘truncate’ for rdbms, which will just drop entire partition.

Can you comment on effeciency of such range deletion?


(A.J. Beamon) #4

A range clear is typically a very quick operation to perform from a user perspective. In the ssd storage engine, it basically just detaches the affected data from the b-tree and defers the cleanup work. In the memory storage engine, it removes the data from the memory data structure and logs the mutation to disk with no deferred work.

The cleanup work in the ssd storage engine is actually a fairly slow process, and it consists of two phases. The first phase processes the detached pages and makes them available for reuse within the file. The second phase removes them from the file. As long as the first phase keeps up with your write load, then there shouldn’t be an issue. If it doesn’t, increasing the size of your cluster should help. See Used disk space dramatically increases while sum of key-value sizes is constant for some additional discussion on this.


(Daniil Gitelson) #5

Thanx for the explanation. Guy who asked that question is working right in front of me :slight_smile:
I was trying to elminate all that cleanup related tasks altogether by deleting big ranges at once (which would just delete files), but it seems impossible if shards are kept together in one file. Are there any plans at least to have an option to split shards into separate files?
This bothers me, because I store some user activity log which I need to keep ~half a year (so storage file going to be quite big), and then I can drop them. So I wanted to prefix each event with Year + Month and once a month drop all outdated events with a single delete and I wanted that operation to be as cheap as possible.

BTW, does this also mean that during repartitioning when shard is moved to another node storage file needs to be ‘cleaned up’ to be able to use space that was occupied by moved shard?


(A.J. Beamon) #6

On the surface, that sounds to me like it should work ok. Clearing a months worth of data can certainly result in a fair amount of deferred cleanup, but I would expect that the cleanup would take less time than a month. In other words, I think cleanup would be sufficiently fast to keep up with your write rate.

No plans for this specifically, but I think there is an interest in improving the speed of deferred cleanup work. I don’t know exactly what form that would take, though.

That’s correct, the same process is used to clean up the space evacuated by data movement.


(Sam Pullara) #7

I have been thinking about a meta-storage engine that would allow you to specify different storage engines for different ranges. I was thinking primarily that I would want that so I could have transactions that span a memory store and the SSD store but this might also be a use case for such a system.


(Christophe Chevalier) #8

There’s something I don’t understand here: What does “detach” and “cleanup” mean here in the context of the B-Tree?

In my implementation of ClearRange in a CoW B-Tree, the pages that are fully covered by the clear range don’t need any “cleaning” and can simply be returned to the free-list (since they can still be accessed by readers until commit, and would be cleared when reallocated later), and only the first/last leaf pages that would be partially covered would need to be mutated (keys removed or tombstoned). So for very large clear ranges, we would be be talking about thousands+ pages (probably contiguous) returned to the Free List at once, which can be very fast depending on how its implemented (poking some bits here and there is your favorite sparse bitmap encoding scheme).

The only thing that could require some async processing, would be moving pages around to reclaim the tail of a data file, or maybe try to defrag “hot” pages closer together after a burst of random writes/splits? Are we talking about this kind of operations?


(A.J. Beamon) #9

There are three different steps happening when you clear data from the B-tree.

The first step, which I described as detaching the data from the B-tree, is performed when the mutation is applied. I’m not super familiar with the code for this step, but I believe the idea is that it tries to grab the minimal number of pages that are needed to delete the data from the tree and store them elsewhere for later processing. In other words, a node whose entire sub-tree is going to be deleted only needs to touch the root node and not any of the children. You can see the code for this here: https://github.com/apple/foundationdb/blob/f418a85390734373b74f022ae6a61bed23eb92ee/fdbserver/sqlite/btree.c#L7201.

The remaining two steps are what I referred to when I mentioned deferred cleanup. The first of these is something we call lazy deletion, where we traverse the sub-trees removed in the first step and properly free each page. Among other things, this puts each page onto the free list and makes it available for reuse when new pages need to be allocated. See https://github.com/apple/foundationdb/blob/f418a85390734373b74f022ae6a61bed23eb92ee/fdbserver/sqlite/btree.c#L7073.

The last step is what we call vacuuming, and it’s doing what you describe; namely it fills in the holes in the file with pages from the end and then truncates off the empty pages. See https://github.com/apple/foundationdb/blob/f418a85390734373b74f022ae6a61bed23eb92ee/fdbserver/sqlite/btree.c#L3042.


(Senthil Ramamoorthy) #10

Could you detail the reason(s) why we end up with this situation, ie - multiple storage servers running on the same process?


(A.J. Beamon) #11

This probably won’t be an exhaustive list, but some ways this can happen:

  • Externally moving data files to the same directory
  • Switching storage engines (data files for both storage engines will be present until data from the old files has been re-replicated in the new files).
  • Some types of corruption (such as missing one of the files in the ssd storage engine) will still allow another storage server to be recruited on the same process. The first storage server should die though, meaning you’ll only have one active one in the process.
  • Some error conditions or races in the code have led to this behavior. We’ve usually tried to take steps to avoid it when we find ways it can happen, but I’m sure it’s still possible.