Io_error on all FDB storage machines

Hi, I had an incident with one of my fdb clusters. The cluster has 3 machines, each one running 5-6 processes that include 3 storage processes. It’s on double replication. The storage servers came close to filling so we expanded our EBS drives and ran xfs_grow. Then transactions started failing due to taking over 5s. status details showed that storage processes on all machines were failing with StorageServerFailed: io_error, and replication health reported UNHEALTHY: No replicas remain of some data. We tried restarting the whole cluster, and it would either report as repairing or Healthy removing storage server for awhile, until storage processes once again began failing with io_error.

My understanding of SS lagging behind is that since we hit an io_error the storage server role was terminated. And since storage servers on all machines were failing, we lost all replicas for some of our data.

Looking through the codebase it seems (?) that io_error being thrown is generally due to either IO performance or file corruption. IO load wasn’t too high, and we cut off incoming requests from the client application. We’ve expanded our drives many times and I find it unlikely that it would cause file corruption, much less on all machines. What could be the cause of hitting io_error on all machines?

io_error isn’t very informative unfortunately. Usually in the trace logs there will be something more informative. A good thing to start searching for would be events with severity >= 30, or grepping for UnixErrorCode

Here’s a copy of one of the tracelogs on a process that has this error:

trace.10.67.246.24.4505.1598849143.xweXq9.0.1.xml:<Event Severity="40" Time="1598849497.011346" Type="StorageServerFailed" ID="630082c7067dbb30" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" Reason="Error" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x19938fc 0x1993218 0x19932e1 0x1039fe4 0x103a94d 0x103aa53 0x101e78e 0x101ea0e 0x6b5e79 0x10361a7 0x1036aee 0xb25d58 0xb25f7e 0x6b5e79 0x102e950 0x102eb22 0xb25d58 0xb25f7e 0x6b5e79 0xa5035f 0x7f2420 0x19d0da0 0x66d463 0x7fc2ab30c2e1" Machine="10.67.246.24:4505" LogGroup="default" Roles="SS" />

We’re running FDB 6.2.22.

So we found the error causing this:

<Event Severity="40" Time="1599095476.197744" Type="DiskError" ID="0000000000000000" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" In="BtreeInsert" File="/storage/var/data/4504/storage-d74abaec8af5c2bc78a1b321fa34987d.sqlite" SQLiteError="database or disk is full" SQLiteErrorCode="13" UnixErrorCode="b" UnixError="Resource temporarily unavailable" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x19938fc 0x1993218 0x19932e1 0xb6d1fe 0xb6e54a 0xa4bfc2 0xa48d69 0x18a1126 0x7ff693bf5000" Machine="10.67.246.32:4504" LogGroup="default" Roles="SS,TL" />

It seems like ratekeeper isn’t detecting that we have <5% free space (which is correct, we have >15% free) but sqlite3 is out of space. Looks like Btreeinsert is always at most adding one keyvalue. Does a sqlite btreeinsert always add only the size of the value written? (I’m not familiar with sqlite but it seems like this would mainly be for a vacuum operation, which doesn’t seem to happen on the btreeinsert codepath)

One theory we have is if we’re hitting sqlite limits. The sqlite page size fdb uses is 4096, and I couldn’t find where we set the max page count so I’m assuming we use the default of 1073741823. We have some massive storage nodes (working on getting to a more distributed setup. we just want to root cause this). Is there an inherent size limit to the size of a single storage server? The sqlite file is bigger than 4096 * 1073741823. I’m not sure if there is some metadata overhead that doesn’t contribute to page count but contributes to file size. I couldn’t get the page_count through PRAGMA max_page_count; because I would get the following error:

sqlite> PRAGMA page_count;
Error: file is encrypted or is not a database

I did this on a sqlite file on a functioning storage server, so it’s definitely not corrupted. I couldn’t find anything in fdb that encrypts the file either. Any thoughts?

Nice job investigating. Here’s the rest of the story:

BTreeInsert inserts a KV pair plus some overhead into the tree. If the KV pair does not fit in an existing page then the page is split into two and the new page comes from the free list. If the free list is empty, it extends the file to allocate new space. To make file extensions for SQLite’s purposes more efficient and avoid file extent fragmentation on disk, FoundationDB extends .sqlite files by a minimum chunk size which is currently 100MB. This is likely why your storage file is larger than 4096 * 1073741823 bytes.

FoundationDB is not using vanilla SQLite, so you cannot use SQLite tools on a .sqlite file and I’m sorry for how misleading that is.

FoundationDB does not change the default maximum page count.

Hey Steve, thanks for the info, this is super helpful for my understanding. Splitting the page and allocating a whole extra one makes sense, but it still limits the size of a single insert at a bit over 110,000 bytes if you write max value size and max key size, round up to the next page, and assume that a bit less than 4096 bytes of application data fit on each page (I know there is some header stuff per page, not sure the size). Guessing it must be <120,000 bytes for a single insert and we have >2TB free. And sounds like I can’t tell how much fdb has extended the file, and I also can’t use sqlite tools to get the page count. Do you have any ideas how I can tell if I hit the max page count? Or any other ideas why we would be getting this disk full error when we have >2TB available?

BTW for reference, we have 3 storage processes per 16TB volume. The sizes of the sqlite files that display these errors, in bytes are:

fdb1 4505: 4963015065600
fdb1 4503: 4870530662400
fdb3 4504: 4816319283200
fdb2 4503: 4848195993600
fdb2 4505: 4865916928000

What I’m saying is any time the file grows it grows by 100MB. Then it has lots of free space to fill up, and if it does that then it will grow again, by another 100MB. So the file will not expand by only the amount of pages needed for the new record.

I’m not sure where you got your quoted max page count figure but according to
https://www.sqlite.org/fileformat.html
the maximum page number is
4294967294 (2<small>32</small> - 2)
and the database header does not seem to store the max page count, though it could be stored elsewhere.

FoundationDB modifies the header string since there are other changes to the format that would break external SQLite tools, but the database header is otherwise in tact.

What is your file size on disk, exactly?

4294967294 is the highest you can set for a max page count, but the default is 1073741823 and I don’t see us set it anywhere for fdb’s usage of sqlite

/*
** Maximum number of pages in one database file.
**
** This is really just the default value for the max_page_count pragma.
** This value can be lowered (or raised) at run-time using that the
** max_page_count macro.
*/
#ifndef SQLITE_MAX_PAGE_COUNT
# define SQLITE_MAX_PAGE_COUNT 1073741823
#endif

https://raw.githubusercontent.com/apple/foundationdb/master/fdbserver/sqlite/sqlite3.amalgamation.c

Jared’s comment above is the file size in bytes of our 5 fdb sqlite files that we see the error from.

I’m not sure how your files got so large. I did find this in sqlite3.amalgamation.c

  /* If the current number of pages in the file is greater than the
  ** configured maximum pager number, increase the allowed limit so
  ** that the file can be read.
  */
  if( nPage>pPager->mxPgno ){
    pPager->mxPgno = (Pgno)nPage;
  }

where nPage comes from either the WAL file or, if that was not valid (which I think can happen if it was fully checkpoint and truncate and there were no further commits before the most recent close) then it takes the size of the .sqlite file. So initially pPager->mxPgno is set to the macro you found but it can be bumped up automatically in this scenario. Perhaps this combined with chunk based extension and a shutdown after checkpoint allowed the files to gradually grow a little bit with each restart?

Are your files still growing at all? If you restart your fdbserver processes, do they grow then?

In any case, we should certainly set a higher max page size in KeyValueStoreSQLite.cpp during initialization, probably to a new server knob which defaults to maybe 8TB worth of pages.

Just restarted it and it doesn’t look like they’re growing. The file sizes are the same even after the storage server has failed

For the snippet you posted, my reading of it is its a bunch of places in the read path. Would it be possible that it is just there to ensure an existing sqlite file is readable? It feels appropriate that it throws SQLITE_FULL on write but not on read. Maybe that’s why we couldn’t find the max page count in the database header, it isn’t stored and sqlite always assumes that the file was written below whatever the max was?

I see this in sqlite3PagerAcquire:

    if( MEMDB || pPager->dbSize<pgno || noContent || !isOpen(pPager->fd) ){
      if( pgno>pPager->mxPgno ){
        rc = SQLITE_FULL;
        goto pager_acquire_err;
      }

I wonder if it sets the max page size to the last page in the sqlite file and then can’t do any more writes.

For higher max page size in KeyValueStoreSQLite.cpp do you mean page count, as in you’d want max page count * max page size > 8TB? It would also be great if there was something that detected this case, like if it gets the page count and max page size in KeyValueStoreSQLite and throws a more specific error when we’re on the last page

If I didn’t hit the max page, I really have no idea why I’m getting SQLITE_FULL

Yes, what I was suggesting is that when opening a file larger than the default max page count it just increases the max page count. For example: If the default size limit were 110MB, and FDB is extending files by 100MB, then when SQLite first grows the file past 100MB via FDB’s filesystem adapter (VFSAsync) the file will be grown to 200MB. SQLite will still complain when it reaches 110MB, but then upon the next open it will see that the files is 200MB so as long as the WAL was logically empty (at a clean checkpoint) on shutdown the page page count will be raised to the equivalent of 200MB. This will then allow more insertions because there will be free space in the file.

I suspect something like this has happened but I don’t quite see a solid path to extending the file repeatedly to end up with the files that you have.

1 Like

An error type other than io_error such as perhaps out_of_space would be more user friendly but it may not be trivial as I think there are some specific reactions to io_error which may need to check for this new type as well.

I created two issues for what we’ve figured out here.


1 Like

Hey, we deployed a custom build that raised the max page count to the max value just to see if this is really the issue and the cluster was able to start up. Previously it would die about 5 min after starting up, but it’s been up for 30 minutes. I’m also not seeing any io_error or Btreeinsert error logs like before. A few of the previously broken files have grown in size (it’s HEALING so the other files will probably grow in size as well later.). I’m also not sure how the files got so far past the max size but it’s pretty safe to say the sqlite max page count was the issue here.

Thanks a ton for your help! In addition to the two issues you filed it would also be great if this was mentioned on the Known Limitations page.