Hi, I had an incident with one of my fdb clusters. The cluster has 3 machines, each one running 5-6 processes that include 3 storage processes. It’s on double replication. The storage servers came close to filling so we expanded our EBS drives and ran xfs_grow. Then transactions started failing due to taking over 5s. status details showed that storage processes on all machines were failing with StorageServerFailed: io_error, and replication health reported UNHEALTHY: No replicas remain of some data. We tried restarting the whole cluster, and it would either report as repairing or Healthy removing storage server for awhile, until storage processes once again began failing with io_error.
My understanding of SS lagging behind is that since we hit an io_error the storage server role was terminated. And since storage servers on all machines were failing, we lost all replicas for some of our data.
Looking through the codebase it seems (?) that io_error being thrown is generally due to either IO performance or file corruption. IO load wasn’t too high, and we cut off incoming requests from the client application. We’ve expanded our drives many times and I find it unlikely that it would cause file corruption, much less on all machines. What could be the cause of hitting io_error on all machines?
io_error isn’t very informative unfortunately. Usually in the trace logs there will be something more informative. A good thing to start searching for would be events with severity >= 30, or grepping for UnixErrorCode
It seems like ratekeeper isn’t detecting that we have <5% free space (which is correct, we have >15% free) but sqlite3 is out of space. Looks like Btreeinsert is always at most adding one keyvalue. Does a sqlite btreeinsert always add only the size of the value written? (I’m not familiar with sqlite but it seems like this would mainly be for a vacuum operation, which doesn’t seem to happen on the btreeinsert codepath)
One theory we have is if we’re hitting sqlite limits. The sqlite page size fdb uses is 4096, and I couldn’t find where we set the max page count so I’m assuming we use the default of 1073741823. We have some massive storage nodes (working on getting to a more distributed setup. we just want to root cause this). Is there an inherent size limit to the size of a single storage server? The sqlite file is bigger than 4096 * 1073741823. I’m not sure if there is some metadata overhead that doesn’t contribute to page count but contributes to file size. I couldn’t get the page_count through PRAGMA max_page_count; because I would get the following error:
sqlite> PRAGMA page_count;
Error: file is encrypted or is not a database
I did this on a sqlite file on a functioning storage server, so it’s definitely not corrupted. I couldn’t find anything in fdb that encrypts the file either. Any thoughts?
Nice job investigating. Here’s the rest of the story:
BTreeInsert inserts a KV pair plus some overhead into the tree. If the KV pair does not fit in an existing page then the page is split into two and the new page comes from the free list. If the free list is empty, it extends the file to allocate new space. To make file extensions for SQLite’s purposes more efficient and avoid file extent fragmentation on disk, FoundationDB extends .sqlite files by a minimum chunk size which is currently 100MB. This is likely why your storage file is larger than 4096 * 1073741823 bytes.
FoundationDB is not using vanilla SQLite, so you cannot use SQLite tools on a .sqlite file and I’m sorry for how misleading that is.
FoundationDB does not change the default maximum page count.
Hey Steve, thanks for the info, this is super helpful for my understanding. Splitting the page and allocating a whole extra one makes sense, but it still limits the size of a single insert at a bit over 110,000 bytes if you write max value size and max key size, round up to the next page, and assume that a bit less than 4096 bytes of application data fit on each page (I know there is some header stuff per page, not sure the size). Guessing it must be <120,000 bytes for a single insert and we have >2TB free. And sounds like I can’t tell how much fdb has extended the file, and I also can’t use sqlite tools to get the page count. Do you have any ideas how I can tell if I hit the max page count? Or any other ideas why we would be getting this disk full error when we have >2TB available?
What I’m saying is any time the file grows it grows by 100MB. Then it has lots of free space to fill up, and if it does that then it will grow again, by another 100MB. So the file will not expand by only the amount of pages needed for the new record.
I’m not sure where you got your quoted max page count figure but according to https://www.sqlite.org/fileformat.html
the maximum page number is 4294967294 (2<small>32</small> - 2)
and the database header does not seem to store the max page count, though it could be stored elsewhere.
FoundationDB modifies the header string since there are other changes to the format that would break external SQLite tools, but the database header is otherwise in tact.
4294967294 is the highest you can set for a max page count, but the default is 1073741823 and I don’t see us set it anywhere for fdb’s usage of sqlite
/*
** Maximum number of pages in one database file.
**
** This is really just the default value for the max_page_count pragma.
** This value can be lowered (or raised) at run-time using that the
** max_page_count macro.
*/
#ifndef SQLITE_MAX_PAGE_COUNT
# define SQLITE_MAX_PAGE_COUNT 1073741823
#endif
I’m not sure how your files got so large. I did find this in sqlite3.amalgamation.c
/* If the current number of pages in the file is greater than the
** configured maximum pager number, increase the allowed limit so
** that the file can be read.
*/
if( nPage>pPager->mxPgno ){
pPager->mxPgno = (Pgno)nPage;
}
where nPage comes from either the WAL file or, if that was not valid (which I think can happen if it was fully checkpoint and truncate and there were no further commits before the most recent close) then it takes the size of the .sqlite file. So initially pPager->mxPgno is set to the macro you found but it can be bumped up automatically in this scenario. Perhaps this combined with chunk based extension and a shutdown after checkpoint allowed the files to gradually grow a little bit with each restart?
Are your files still growing at all? If you restart your fdbserver processes, do they grow then?
In any case, we should certainly set a higher max page size in KeyValueStoreSQLite.cpp during initialization, probably to a new server knob which defaults to maybe 8TB worth of pages.
Just restarted it and it doesn’t look like they’re growing. The file sizes are the same even after the storage server has failed
For the snippet you posted, my reading of it is its a bunch of places in the read path. Would it be possible that it is just there to ensure an existing sqlite file is readable? It feels appropriate that it throws SQLITE_FULL on write but not on read. Maybe that’s why we couldn’t find the max page count in the database header, it isn’t stored and sqlite always assumes that the file was written below whatever the max was?
I wonder if it sets the max page size to the last page in the sqlite file and then can’t do any more writes.
For higher max page size in KeyValueStoreSQLite.cpp do you mean page count, as in you’d want max page count * max page size > 8TB? It would also be great if there was something that detected this case, like if it gets the page count and max page size in KeyValueStoreSQLite and throws a more specific error when we’re on the last page
If I didn’t hit the max page, I really have no idea why I’m getting SQLITE_FULL
Yes, what I was suggesting is that when opening a file larger than the default max page count it just increases the max page count. For example: If the default size limit were 110MB, and FDB is extending files by 100MB, then when SQLite first grows the file past 100MB via FDB’s filesystem adapter (VFSAsync) the file will be grown to 200MB. SQLite will still complain when it reaches 110MB, but then upon the next open it will see that the files is 200MB so as long as the WAL was logically empty (at a clean checkpoint) on shutdown the page page count will be raised to the equivalent of 200MB. This will then allow more insertions because there will be free space in the file.
I suspect something like this has happened but I don’t quite see a solid path to extending the file repeatedly to end up with the files that you have.
An error type other than io_error such as perhaps out_of_space would be more user friendly but it may not be trivial as I think there are some specific reactions to io_error which may need to check for this new type as well.
I created two issues for what we’ve figured out here.
Hey, we deployed a custom build that raised the max page count to the max value just to see if this is really the issue and the cluster was able to start up. Previously it would die about 5 min after starting up, but it’s been up for 30 minutes. I’m also not seeing any io_error or Btreeinsert error logs like before. A few of the previously broken files have grown in size (it’s HEALING so the other files will probably grow in size as well later.). I’m also not sure how the files got so far past the max size but it’s pretty safe to say the sqlite max page count was the issue here.
Thanks a ton for your help! In addition to the two issues you filed it would also be great if this was mentioned on the Known Limitations page.