StorageServerFailed with unknown_error after 6.2.30 upgrade

Hi, I upgraded one of my large FoundationDB clusters from 6.2.25 to 6.2.30 yesterday. Around 12 hours after the upgrade, one of the storage fdbservers failed with:

"messages" : [
                    {
                        "description" : "StorageServerFailed: unknown_error at Tue Oct 19 07:17:21 2021",
                        "name" : "unknown_error",
                        "raw_log_message" : "\"Severity\"=\"40\", \"Time\"=\"1634627841.978797\", \"DateTime\"=\"2021-10-19T07:17:21Z\", \"Type\"=\"StorageServerFailed\", \"ID\"=\"666a86a655a76bbc\", \"Error\"=\"unknown_error\", \"ErrorDescription\"=\"An unknown error occurred\", \"ErrorCode\"=\"4000\", \"Reason\"=\"Error\", \"Backtrace\"=\"addr2line -e fdbserver.debug -p -C -f -i 0x19ed48c 0x19ecc40 0x19ecd21 0x1078f9a 0x107996d 0x1079ab3 0x105d2ae 0x105d52e 0x6c1ec9 0x1074fd7 0x107591e 0xb42558 0xb4277e 0x6c1ec9 0x106d7a3 0x106d81f 0x106d96c 0x8018f0 0xb962d3 0xb96767 0xb96e61 0x6c2728 0x1a22500 0x1a1f167 0x1a1f6b0 0xa59580 0x8018f0 0x1a2b270 0x6784f9 0x7fa4d865d09b\", \"Machine\"=\"<IP>:4613\", \"LogGroup\"=\"default\", \"Roles\"=\"SS\"",
                        "time" : 1634630000,
                        "type" : "StorageServerFailed"
                    }
                ],

Ran addr2line with fdbserver 6.2.30 debug:

root@3471274ba32b:/src/storage# addr2line -e fdbserver.debug -p -C -f -i 0x19ed48c 0x19ecc40 0x19ecd21 0x1078f9a 0x107996d 0x1079ab3 0x105d2ae 0x105d52e 0x6c1ec9 0x1074fd7 0x107591e 0xb42558 0xb4277e 0x6c1ec9 0x106d7a3 0x106d81f 0x106d96c 0x8018f0 0xb962d3 0xb96767 0xb96e61 0x6c2728 0x1a22500 0x1a1f167 0x1a1f6b0 0xa59580 0x8018f0 0x1a2b270 0x6784f9 0x7fa4d865d09b
TraceEvent::backtrace(std::string const&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) ?? at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.tcc:613
 (inlined by) ?? at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:5967
 (inlined by) TraceEvent::backtrace(std::string const&) at /opt/foundation/foundationdb/flow/Trace.cpp:1020
TraceEvent::log() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) ?? at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3640
 (inlined by) TraceEvent::log() at /opt/foundation/foundationdb/flow/Trace.cpp:1037
TraceEvent::~TraceEvent() at /opt/foundation/foundationdb/./flow/Trace.h:60
 (inlined by) TraceEvent::~TraceEvent() at /opt/foundation/foundationdb/flow/Trace.cpp:1069
endRole(Role const&, UID, std::string, bool, Error) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) ?? at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3640
 (inlined by) endRole(Role const&, UID, std::string, bool, Error) at /opt/foundation/foundationdb/fdbserver/worker.actor.cpp:656
(anonymous namespace)::WorkerHandleErrorsActorState<(anonymous namespace)::WorkerHandleErrorsActor>::a_body1loopBody1when1(ErrorInfo const&, int) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) ?? at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3640
 (inlined by) a_body1loopBody1when1 at /opt/foundation/foundationdb/fdbserver/worker.actor.cpp:186
ActorSingleCallback<(anonymous namespace)::WorkerHandleErrorsActor, 0, ErrorInfo>::fire(ErrorInfo const&) at /opt/foundation/foundationdb/./flow/flow.h:1010
(anonymous namespace)::ForwardErrorActorState<(anonymous namespace)::ForwardErrorActor>::a_body1Catch2(Error const&, int) [clone .isra.1783] at /opt/foundation/foundationdb/fdbserver/worker.actor.cpp:139
ActorCallback<(anonymous namespace)::ForwardErrorActor, 0, Void>::error(Error) at /opt/foundation/foundationdb/./flow/flow.h:1003
Killed

Which doesn’t seem to say much about the error, just that an error was thrown and logged? I checked the logs, these are the logs with severity > 10. I filtered out some SlowTask/Net2SlowTaskTrace logs that didn’t seem super helpful, as indicated by [...], happy to post them too. There were no severity > 10 logs at all before the SystemError log line, and the log file started about 6 minutes prior to hitting the SystemError.

<Event Severity="40" Time="1634627839.189913" DateTime="2021-10-19T07:17:19Z" Type="SystemError" ID="0000000000000000" Error="unknown_error" ErrorDescription="An unknown error occurred" ErrorCode="4000" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x19ed48c 0x19ecc40 0x19ecd21 0x19a3203 0x5103f5 0x8018f0 0x1a2b270 0x6784f9 0x7fa4d865d09b" Machine="<IP>:4613" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1634627839.189913" DateTime="2021-10-19T07:17:19Z" Type="SlowTask" ID="0000000000000000" TaskID="7010" MClocks="2192.39" Duration="0.876958" SampleRate="1" NumYields="0" Machine="<IP>:4613" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1634627839.189913" DateTime="2021-10-19T07:17:19Z" Type="Net2SlowTaskTrace" ID="0000000000000000" TraceTime="1634627839.414140" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7fa4d8999730 0xd3d5fb 0xd3d624 0xd3d60b 0xd3d60b 0xd3d624 0xd3d624 0xd3d60b 0xd3d60b 0xd3d60b 0xd3d624 0xd3d624 0xd3d60b 0xd3d60b 0xd3d624 0xd3d624 0xd3d60b 0xd3d624 0xd3d624 0xd3d624 0xd3d624 0xd3d60b 0xd3d624 0xd3d60b 0x1005499 0xfbf610 0xfc90dc 0xfc957e 0x6c1ec9 0xfab332 0x1a2150a 0x1a1eeef 0x1a1f53e 0x1a1f7f2 0x6c1ec9 0x510401 0x8018f0 0x1a2b270 0x6784f9 0x7fa4d865d09b 0x6b2385" Machine="<IP>:4613" LogGroup="default" Roles="SS" />
[...]
<Event Severity="20" Time="1634627840.073183" DateTime="2021-10-19T07:17:20Z" Type="FailureMonitorClientSlow" ID="0000000000000000" Elapsed="0.959561" Expected="0.1" Machine="<IP>:4613" LogGroup="default" Roles="SS" />
[...]
<Event Severity="20" Time="1634627841.977139" DateTime="2021-10-19T07:17:21Z" Type="FailureMonitorClientSlow" ID="0000000000000000" Elapsed="1.90303" Expected="0.1" Machine="<IP>:4613" LogGroup="default" Roles="SS" />
<Event Severity="40" Time="1634627841.978797" DateTime="2021-10-19T07:17:21Z" Type="StorageServerFailed" ID="666a86a655a76bbc" Error="unknown_error" ErrorDescription="An unknown error occurred" ErrorCode="4000" Reason="Error" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x19ed48c 0x19ecc40 0x19ecd21 0x1078f9a 0x107996d 0x1079ab3 0x105d2ae 0x105d52e 0x6c1ec9 0x1074fd7 0x107591e 0xb42558 0xb4277e 0x6c1ec9 0x106d7a3 0x106d81f 0x106d96c 0x8018f0 0xb962d3 0xb96767 0xb96e61 0x6c2728 0x1a22500 0x1a1f167 0x1a1f6b0 0xa59580 0x8018f0 0x1a2b270 0x6784f9 0x7fa4d865d09b" Machine="<IP>:4613" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1634628019.851092" DateTime="2021-10-19T07:20:19Z" Type="N2_ReadError" ID="fb1fb0aa9e238539" SuppressedEventCount="1" ErrorCode="1" Message="stream truncated" Machine="<IP>:4613" LogGroup="default" Roles="SS" />

I ran the initial SystemError addr2line as well:

root@3471274ba32b:/src/storage# addr2line -e fdbserver.debug -p -C -f -i 0x19ed48c 0x19ecc40 0x19ecd21 0x19a3203 0x5103f5 0x8018f0 0x1a2b270 0x6784f9 0x7fa4d865d09b
TraceEvent::backtrace(std::string const&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) ?? at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.tcc:613
 (inlined by) ?? at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:5967
 (inlined by) TraceEvent::backtrace(std::string const&) at /opt/foundation/foundationdb/flow/Trace.cpp:1020
TraceEvent::log() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3322
 (inlined by) ?? at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3640
 (inlined by) TraceEvent::log() at /opt/foundation/foundationdb/flow/Trace.cpp:1037
TraceEvent::~TraceEvent() at /opt/foundation/foundationdb/./flow/Trace.h:60
 (inlined by) TraceEvent::~TraceEvent() at /opt/foundation/foundationdb/flow/Trace.cpp:1069
Error::Error(int) at /opt/foundation/foundationdb/flow/Error.cpp:67
ActorCallback<(anonymous namespace)::ReportStorageServerStateActor, 0, Void>::fire(Void const&) [clone .cold.21771] at /opt/foundation/foundationdb/.objs/fdbserver/storageserver.actor.g.cpp:13612
 (inlined by) ?? at /opt/foundation/foundationdb/.objs/fdbserver/storageserver.actor.g.cpp:13690
 (inlined by) fire at /opt/foundation/foundationdb/./flow/flow.h:999
Killed

Looked at the 6.2.30 code and it also doesn’t seem like this tells us much besides that an error was thrown and logged.

I bounced the fdbserver (killed the process manually, fdbmonitor spun it back up) and it was fine after that. But since it had failed and not come back up on its own, the cluster had started a large data distribution job to recover the RF and the io load caused batch performance limited.

It doesn’t seem like the cluster was specifically under load (or any other abnormal workload) prior to the fdbserver going down, and I don’t think it hit any system or process limits. Along with the 6.2.30 upgrade we also upgraded the base docker image from debian 9.13-slim to 10.10-slim, but I would not expect that to be related (?)

I looked into unknown_error / SystemError which didn’t say much:

// 4xxx Internal errors (those that should be generated only by bugs) are decimal 4xxx
ERROR( unknown_error, 4000, "An unknown error occurred" )  // C++ exception not of type Error
ERROR( internal_error, 4100, "An internal error occurred" )

(foundationdb/error_definitions.h at 6.2.30 · apple/foundationdb · GitHub)

if (error_code >= 3000 && error_code < 6000) {
		TraceEvent(SevError, "SystemError").error(*this).backtrace();
		if (g_crashOnError) {
			flushOutputStreams();
			flushTraceFileVoid();
			crashAndDie();
		}
	}

(foundationdb/Error.cpp at 6.2.30 · apple/foundationdb · GitHub)

Any tips on how to further debug this, especially if it’s related to the 6.2.30 upgrade? Were there any compatibility changes in 6.2.25 to 6.2.30? We’re holding off on upgrading our other production clusters as a result.

Thanks in advance! Happy to provide any other info.

This is probably the same as the bug fixed in 6.2.33, and if I’m reading correctly it was introduced in 6.2.28. You could probably try 6.2.27 instead to avoid the problem.

1 Like

Ah yup, that looks like the issue we hit. We will downgrade. Thanks!