Internal_error KeyvalueStoreSQLite.actor.cpp

Hi FDB Team,

Today a storage server in one of our FDB clusters running 6.3.25 with SQLite reported an internal error. Looking at the KeyvalueStoreSQLite.actor.cpp it looks like it was expecting more than 1 fragment during a Cursor getNext() call. But the root cause to this error is not clear to us.

We would really appreciate if you could help us to understand what may cause this error.

fdb status:

  10.0.46.245:4501       (  6% cpu;  7% machine; 0.000 Gbps;  6% disk IO; 4.3 GB / 8.0 GB RAM  )
    Last logged error: StorageServerFailed: internal_error at Fri Mar 17 07:35:52 2023

fdbmonitor log:

Time="1678695262.344335" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Started FoundationDB Process Monitor 6.3 (v6.3.25)
Time="1678695262.344583" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Watching conf file /var/dynamic-conf/fdbmonitor.conf
Time="1678695262.344598" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Watching conf dir /var/dynamic-conf/ (2)
Time="1678695262.344618" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Loading configuration /var/dynamic-conf/fdbmonitor.conf
Time="1678695262.344818" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Starting fdbserver.1
Time="1678695262.345167" Severity="10" LogGroup="foundationdb-cluster" Process="fdbserver.1": Launching /usr/bin/fdbserver (11) for fdbserver.1
Time="1678695262.417711" Severity="10" LogGroup="foundationdb-cluster" Process="fdbserver.1": FDBD joined cluster.
Time="1678695263.968410" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Could not remove inotify conf file watch, continuing...
Time="1678695263.968461" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Watching conf file /var/dynamic-conf/fdbmonitor.conf
Time="1678695263.968474" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Watching conf dir /var/dynamic-conf/ (4)
Time="1678695263.968483" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Loading configuration /var/dynamic-conf/fdbmonitor.conf
Time="1678695263.968617" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Updated configuration for fdbserver.1
Time="1679038552.040874" Severity="40" LogGroup="foundationdb-cluster" Process="fdbserver.1": Assertion fragments != 1 failed @ /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp 1076:
Time="1679038552.040915" Severity="40" LogGroup="foundationdb-cluster" Process="fdbserver.1":   addr2line -e fdbserver.debug -p -C -f -i 0x25aef74 0xcc890d 0xcc91fc 0xcce387 0xb39b6c 0xb35e59 0x25a28a6 0x7f287b896190

error stacktrace:

addr2line -e fdbserver.debug.x86_64 -p -C -f -i 0x25aef74 0xcc890d 0xcc91fc 0xcce387 0xb39b6c 0xb35e59 0x25a28a6 0x7f287b896190
internal_error_impl(char const*, char const*, int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Error.cpp:66
RawCursor::DefragmentingReader::getNext() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1076
RawCursor::getRange(KeyRangeRef, int, int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1173
KeyValueStoreSQLite::Reader::action(KeyValueStoreSQLite::Reader::ReadRangeAction&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1698
 (inlined by) TypedAction<KeyValueStoreSQLite::Reader, KeyValueStoreSQLite::Reader::ReadRangeAction>::operator()(IThreadPoolReceiver*) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/IThreadPool.h:76
yield(TaskPriority) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:1043
 (inlined by) WorkPool<Coroutine, ThreadUnsafeSpinLock, true>::Worker::run() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:145
Coroutine::wrapRun() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:82
 (inlined by) Coroutine::entry(void*) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:86
Coro_StartWithArg at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbrpc/libcoroutine/Coro.c:250
?? ??:0

Looks like this is an assertion failure at KeyValueStoreSQLite.actor.cpp:1076. I’ve never seen this before and am not sure what’s wrong. It’s safe to exclude this server and let DD to rebalance data off this storage server to solve the problem.

Thank your for the response.
Excluding the node was precisely our approach. But we have now seen this error in two nodes running in different hosts and different disks. So we would like to continue with the root cause analysis.
The nodes are running version 6.3.25 in a managed kubernetes cluster with managed disks.
The only pattern we have been able to find so far is that both hosts had been running for more than 300 hundred days and they have heavy read and write load.

We would really appreciate if you could help us with the Root Cause Analysis.

Maybe unrelated but in another cluster with the same pattern we have seen 4 failures in the last week. But these errors point to SQLite database corruption. The errors are different but we observe the same pattern. Same k8s provider, +300 days old disks, and heavy traffic. If you think this should be discussed in a separate thread let me know and I will gladly open a new one.

It is not clear to us whether these are caused by disk/filesystem errors leading to database corruption or SQLite bugs. Since we are using managed disks on a cloud provider we have limited visibility of the storage layer. Because FDB uses non vanilla SQLite we would appreciate if you could let us know how to troubleshoot the following errors on the SQLite side of things.

This error happened twice in two different SS in the same cluster:

Time="1679109778.883363" Severity="10" LogGroup="foundationdb-cluster" Process="fdbserver.1": database corruption line 5463
$ addr2line -e fdbserver.debug.x86_64 -p -C -f -i 0x26440fc 0x2645120 0x26454b1 0x1503171 0x150396d 0x1503ab5 0x14f00ea 0x14f03c2 0x82bdc9 0x14fbe27 0x14fc7d5 0xa40a10 0xa40ce2 0x82bdc9 0x14f0c40 0x14f0dec 0xa40a10 0xa40ce2 0x82bdc9 0xb4a0bf 0xc47a60 0x25ecee7 0x7cf95d 0x7f9a9a77a555
std::string::_M_rep() const at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.tcc:613
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> > std::operator+<char, std::char_traits<char>, std::allocator<char> >(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:5967
 (inlined by) TraceEvent::backtrace(std::string const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1235
std::string::_M_rep() const at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3640
 (inlined by) TraceEvent::log() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1254
TraceEventFields::~TraceEventFields() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.h:70
 (inlined by) TraceEvent::~TraceEvent() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1292

This error also was reported twice by two different SS:

Time="1679110567.026536" Severity="10" LogGroup="foundationdb-cluster" Process="fdbserver.1": database corruption line 4002
$ addr2line -e fdbserver.debug.x86_64 -p -C -f -i 0x26440fc 0x2645120 0x26454b1 0xcc57e6 0xcc6191 0xcc632c 0xcc8ab1 0xcce387 0xb39b6c 0xb35e59 0x25a28a6 0x7fcf2f748190
std::string::_M_rep() const at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.tcc:613
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> > std::operator+<char, std::char_traits<char>, std::allocator<char> >(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:5967
 (inlined by) TraceEvent::backtrace(std::string const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1235
std::string::_M_rep() const at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3640
 (inlined by) TraceEvent::log() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1254
TraceEventFields::~TraceEventFields() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.h:70
 (inlined by) TraceEvent::~TraceEvent() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1292
SQLiteDB::checkError(char const*, int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:289
RawCursor::getEncodedRow(Arena&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:762
RawCursor::DefragmentingReader::parse() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:991
RawCursor::getRange(KeyRangeRef, int, int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1156
KeyValueStoreSQLite::Reader::action(KeyValueStoreSQLite::Reader::ReadRangeAction&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1698
 (inlined by) TypedAction<KeyValueStoreSQLite::Reader, KeyValueStoreSQLite::Reader::ReadRangeAction>::operator()(IThreadPoolReceiver*) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/IThreadPool.h:76
yield(TaskPriority) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:1043
 (inlined by) WorkPool<Coroutine, ThreadUnsafeSpinLock, true>::Worker::run() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:145
Coroutine::wrapRun() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:82
 (inlined by) Coroutine::entry(void*) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:86
Coro_StartWithArg at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbrpc/libcoroutine/Coro.c:250
?? ??:0

From the error SQLITE_CORRUPT_BKPT, this seems to indicate a data corruption detected:

#define SQLITE_CORRUPT_BKPT sqlite3CorruptError(__LINE__)

Since you mentioned the disk is old and under heavy traffic, I’d suspect it’s probably disk corruption problems (hardware errors?) and would just replace these disks (migrate to other storage servers by excluding them and adding new storage servers).

Do you have consistencycheck role running? This role can detect data corruptions by comparing data replicas in a constant interval.

Thank you. We are in the process of excluding old nodes. Hopefully that will make things better.

AFAIK consistency check is not currently supported by the operator. We will evaluate running it manually.
I will also open an issue in the operator repo to request the enhancement.

Again thank you very much for your time.

Hi

Unfortunately after excluding the nodes running older disks we recently hit the same error line 4002

This is a relatively new disk running for less than 2 months. And this is the 3th occurrence of the error within a week. All reported by different nodes running of different hosts. So hardware malfunction is starting to look like less likely. At least it doesn’t seem to be the only reason.

The consistencycheck documentation is not very informative. Could you provide some guidance to get it up and running and how to check the generated information?

If you have any other ideas/hypothesis or strategies to progress the root cause analysis we would really appreciate any help you could provide.

Thank you for your time and effort.

I found this issue and managed to run the process manually:

It will be great if this info makes it to the documentation :smiley:

Hi again,
The consistency check has been running for several hours now. Looks like it is still reading data. But it is hard to tell how much progress it has made.
:thinking: How does the check report a successful run?
The only activity I see so far is in the trace logs. ConsistencyCheck_* but it is hard to tell if the test finished just reading the logs.

Thank you :bowing_man:

Ok I see the check finished from the logs.

<Event Severity="10" Time="1679284849.196455" DateTime="2023-03-20T04:00:49Z" Type="ConsistencyCheck_FinishedCheck" ID="0000000000000000" Repetitions="0" ThreadID="16350227908004932116" Machine="redacted" LogGroup="foundationdb-cluster" Roles="TS" /

My assumption is that it succeeded since it logged the completion and no failures were logged:

Based on this I expect TestFailure Warning or Error messages to be logged when the test fails:
ConsistencyCheck.actor.cpp#L143

If my understanding is not correct please correct me :sweat_smile:

If there is a better way of tracking completion and errors please let me know.

Thank you.

This is almost certainly caused by your disk breaking the disk durability contract.

Blocks on disk have checksums to prevent corruption, but this cannot protect from “lost writes”, where a page update which is fsync()'d remains at its previous state instead when later read back from disk. Checksums do not help here because the old state of the page had a valid checksum, the page was just never updated to the new content.

A lost write can cause BTree invariant violations which are the type of errors you are seeing. A lost write can also cause stale data to be returned from the SQLite tree without it detecting an invariant violation internally. In the fragment error you saw, SQLite itself did not detect an issue and so it returned a bad result, but this broke an invariant in the FDB KV modeling layer on top of the SQLite tree which adds record fragmenting to avoid a high slack space issue with SQLite’s structure. If the record were not fragmented, the error would not have been detected by the KV modeling layer and a bad result would have been returned from the storage engine.

There is an incredible amount of simulated and real world testing and at-scale usage of the SQLite storage engine, and its code in FDB and the underlying SQLite code have not changed meaningfully in many years. A lost write can cause all of the issues you have seen, and is the most likely cause.

The good news is, we’ve encountered badly behaving disks before and so there is an FDB feature you can turn on which will prove that this is what is happening.

If you set this option in your FDB conf file in the fdbserver section

knob_page_write_checksum_history = 50000000

then FDB track written block checksums within a process’s lifetime for the first 50,000,000 4k blocks of space in your data files, and use these to verify the checksums of any of those blocks when read back from disk later after cache eviction. If a mismatch is found, it will log a AsyncFileLostWriteDetected event.

I suggest enabling this on your cluster and looking for these events. The memory overhead of this feature is approximately the history size setting * 8, so around 400MB of memory to track 50 million blocks which would cover all blocks of a 200GB data file.

6 Likes

Thank you for the thorough explanation :+1:

We have now rolled out the checksum history to a subset of our clusters, and we have seen an occurrence of AsyncFileLostWriteDetected. We are now investigating the issue together with our disk vendor.

When we do see these issues, is there any way to recover, other than wiping the disk completely and waiting for replication from other pods?

Is it relevant that we have changed the knob MAX_STORAGE_COMMIT_TIME from 120 to 3600 seconds?

No, and even this does not guarantee recovery because silently corrupted data may have already been read and replicated elsewhere as part of FDB data movement.

No.

I was wondering if the storage fault model for Redwood engine has been document somewhere? I would be really interested to learn the types of storage faults that Redwood might be able to recover from, which SQLite might not be able to.

1 Like

Redwood’s corruption protection is essentially the same as the SQLite engine - Pages on disk have an XXH3_64bits checksum and the page location within the file is included as part of the checksum input. This means that:

  • A corrupted page should fail verification
  • If pages are swapped around somehow such as by filesystem or disk metadata corruption such that reading page X returns a page that was written as some other page Y the read would fail verification

This scheme can not protect against disk contract breaking lost writes (as in, omitting a write that is both write-ack’d and fsync-ack’d, leaving prior block content unmodified), which is fundamentally hard to do for any data structures which individually reuse and update pages.

1 Like