Internal_error KeyvalueStoreSQLite.actor.cpp

Hi FDB Team,

Today a storage server in one of our FDB clusters running 6.3.25 with SQLite reported an internal error. Looking at the KeyvalueStoreSQLite.actor.cpp it looks like it was expecting more than 1 fragment during a Cursor getNext() call. But the root cause to this error is not clear to us.

We would really appreciate if you could help us to understand what may cause this error.

fdb status:

  10.0.46.245:4501       (  6% cpu;  7% machine; 0.000 Gbps;  6% disk IO; 4.3 GB / 8.0 GB RAM  )
    Last logged error: StorageServerFailed: internal_error at Fri Mar 17 07:35:52 2023

fdbmonitor log:

Time="1678695262.344335" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Started FoundationDB Process Monitor 6.3 (v6.3.25)
Time="1678695262.344583" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Watching conf file /var/dynamic-conf/fdbmonitor.conf
Time="1678695262.344598" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Watching conf dir /var/dynamic-conf/ (2)
Time="1678695262.344618" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Loading configuration /var/dynamic-conf/fdbmonitor.conf
Time="1678695262.344818" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Starting fdbserver.1
Time="1678695262.345167" Severity="10" LogGroup="foundationdb-cluster" Process="fdbserver.1": Launching /usr/bin/fdbserver (11) for fdbserver.1
Time="1678695262.417711" Severity="10" LogGroup="foundationdb-cluster" Process="fdbserver.1": FDBD joined cluster.
Time="1678695263.968410" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Could not remove inotify conf file watch, continuing...
Time="1678695263.968461" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Watching conf file /var/dynamic-conf/fdbmonitor.conf
Time="1678695263.968474" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Watching conf dir /var/dynamic-conf/ (4)
Time="1678695263.968483" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Loading configuration /var/dynamic-conf/fdbmonitor.conf
Time="1678695263.968617" Severity="10" LogGroup="foundationdb-cluster" Process="fdbmonitor": Updated configuration for fdbserver.1
Time="1679038552.040874" Severity="40" LogGroup="foundationdb-cluster" Process="fdbserver.1": Assertion fragments != 1 failed @ /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp 1076:
Time="1679038552.040915" Severity="40" LogGroup="foundationdb-cluster" Process="fdbserver.1":   addr2line -e fdbserver.debug -p -C -f -i 0x25aef74 0xcc890d 0xcc91fc 0xcce387 0xb39b6c 0xb35e59 0x25a28a6 0x7f287b896190

error stacktrace:

addr2line -e fdbserver.debug.x86_64 -p -C -f -i 0x25aef74 0xcc890d 0xcc91fc 0xcce387 0xb39b6c 0xb35e59 0x25a28a6 0x7f287b896190
internal_error_impl(char const*, char const*, int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Error.cpp:66
RawCursor::DefragmentingReader::getNext() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1076
RawCursor::getRange(KeyRangeRef, int, int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1173
KeyValueStoreSQLite::Reader::action(KeyValueStoreSQLite::Reader::ReadRangeAction&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1698
 (inlined by) TypedAction<KeyValueStoreSQLite::Reader, KeyValueStoreSQLite::Reader::ReadRangeAction>::operator()(IThreadPoolReceiver*) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/IThreadPool.h:76
yield(TaskPriority) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:1043
 (inlined by) WorkPool<Coroutine, ThreadUnsafeSpinLock, true>::Worker::run() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:145
Coroutine::wrapRun() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:82
 (inlined by) Coroutine::entry(void*) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:86
Coro_StartWithArg at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbrpc/libcoroutine/Coro.c:250
?? ??:0

Looks like this is an assertion failure at KeyValueStoreSQLite.actor.cpp:1076. I’ve never seen this before and am not sure what’s wrong. It’s safe to exclude this server and let DD to rebalance data off this storage server to solve the problem.

Thank your for the response.
Excluding the node was precisely our approach. But we have now seen this error in two nodes running in different hosts and different disks. So we would like to continue with the root cause analysis.
The nodes are running version 6.3.25 in a managed kubernetes cluster with managed disks.
The only pattern we have been able to find so far is that both hosts had been running for more than 300 hundred days and they have heavy read and write load.

We would really appreciate if you could help us with the Root Cause Analysis.

Maybe unrelated but in another cluster with the same pattern we have seen 4 failures in the last week. But these errors point to SQLite database corruption. The errors are different but we observe the same pattern. Same k8s provider, +300 days old disks, and heavy traffic. If you think this should be discussed in a separate thread let me know and I will gladly open a new one.

It is not clear to us whether these are caused by disk/filesystem errors leading to database corruption or SQLite bugs. Since we are using managed disks on a cloud provider we have limited visibility of the storage layer. Because FDB uses non vanilla SQLite we would appreciate if you could let us know how to troubleshoot the following errors on the SQLite side of things.

This error happened twice in two different SS in the same cluster:

Time="1679109778.883363" Severity="10" LogGroup="foundationdb-cluster" Process="fdbserver.1": database corruption line 5463
$ addr2line -e fdbserver.debug.x86_64 -p -C -f -i 0x26440fc 0x2645120 0x26454b1 0x1503171 0x150396d 0x1503ab5 0x14f00ea 0x14f03c2 0x82bdc9 0x14fbe27 0x14fc7d5 0xa40a10 0xa40ce2 0x82bdc9 0x14f0c40 0x14f0dec 0xa40a10 0xa40ce2 0x82bdc9 0xb4a0bf 0xc47a60 0x25ecee7 0x7cf95d 0x7f9a9a77a555
std::string::_M_rep() const at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.tcc:613
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> > std::operator+<char, std::char_traits<char>, std::allocator<char> >(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:5967
 (inlined by) TraceEvent::backtrace(std::string const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1235
std::string::_M_rep() const at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3640
 (inlined by) TraceEvent::log() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1254
TraceEventFields::~TraceEventFields() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.h:70
 (inlined by) TraceEvent::~TraceEvent() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1292

This error also was reported twice by two different SS:

Time="1679110567.026536" Severity="10" LogGroup="foundationdb-cluster" Process="fdbserver.1": database corruption line 4002
$ addr2line -e fdbserver.debug.x86_64 -p -C -f -i 0x26440fc 0x2645120 0x26454b1 0xcc57e6 0xcc6191 0xcc632c 0xcc8ab1 0xcce387 0xb39b6c 0xb35e59 0x25a28a6 0x7fcf2f748190
std::string::_M_rep() const at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.tcc:613
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> > std::operator+<char, std::char_traits<char>, std::allocator<char> >(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:5967
 (inlined by) TraceEvent::backtrace(std::string const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1235
std::string::_M_rep() const at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3640
 (inlined by) TraceEvent::log() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1254
TraceEventFields::~TraceEventFields() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.h:70
 (inlined by) TraceEvent::~TraceEvent() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1292
SQLiteDB::checkError(char const*, int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:289
RawCursor::getEncodedRow(Arena&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:762
RawCursor::DefragmentingReader::parse() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:991
RawCursor::getRange(KeyRangeRef, int, int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1156
KeyValueStoreSQLite::Reader::action(KeyValueStoreSQLite::Reader::ReadRangeAction&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1698
 (inlined by) TypedAction<KeyValueStoreSQLite::Reader, KeyValueStoreSQLite::Reader::ReadRangeAction>::operator()(IThreadPoolReceiver*) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/IThreadPool.h:76
yield(TaskPriority) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:1043
 (inlined by) WorkPool<Coroutine, ThreadUnsafeSpinLock, true>::Worker::run() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:145
Coroutine::wrapRun() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:82
 (inlined by) Coroutine::entry(void*) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/CoroFlow.actor.cpp:86
Coro_StartWithArg at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbrpc/libcoroutine/Coro.c:250
?? ??:0

From the error SQLITE_CORRUPT_BKPT, this seems to indicate a data corruption detected:

#define SQLITE_CORRUPT_BKPT sqlite3CorruptError(__LINE__)

Since you mentioned the disk is old and under heavy traffic, I’d suspect it’s probably disk corruption problems (hardware errors?) and would just replace these disks (migrate to other storage servers by excluding them and adding new storage servers).

Do you have consistencycheck role running? This role can detect data corruptions by comparing data replicas in a constant interval.

Thank you. We are in the process of excluding old nodes. Hopefully that will make things better.

AFAIK consistency check is not currently supported by the operator. We will evaluate running it manually.
I will also open an issue in the operator repo to request the enhancement.

Again thank you very much for your time.

Hi

Unfortunately after excluding the nodes running older disks we recently hit the same error line 4002

This is a relatively new disk running for less than 2 months. And this is the 3th occurrence of the error within a week. All reported by different nodes running of different hosts. So hardware malfunction is starting to look like less likely. At least it doesn’t seem to be the only reason.

The consistencycheck documentation is not very informative. Could you provide some guidance to get it up and running and how to check the generated information?

If you have any other ideas/hypothesis or strategies to progress the root cause analysis we would really appreciate any help you could provide.

Thank you for your time and effort.

I found this issue and managed to run the process manually:

It will be great if this info makes it to the documentation :smiley:

Hi again,
The consistency check has been running for several hours now. Looks like it is still reading data. But it is hard to tell how much progress it has made.
:thinking: How does the check report a successful run?
The only activity I see so far is in the trace logs. ConsistencyCheck_* but it is hard to tell if the test finished just reading the logs.

Thank you :bowing_man:

Ok I see the check finished from the logs.

<Event Severity="10" Time="1679284849.196455" DateTime="2023-03-20T04:00:49Z" Type="ConsistencyCheck_FinishedCheck" ID="0000000000000000" Repetitions="0" ThreadID="16350227908004932116" Machine="redacted" LogGroup="foundationdb-cluster" Roles="TS" /

My assumption is that it succeeded since it logged the completion and no failures were logged:

Based on this I expect TestFailure Warning or Error messages to be logged when the test fails:
ConsistencyCheck.actor.cpp#L143

If my understanding is not correct please correct me :sweat_smile:

If there is a better way of tracking completion and errors please let me know.

Thank you.