I have a 3 Node foundationdb test cluster running fdb 6.2.11 where I am observing fdbserver occasionally going Out Of Memory.
The backtrace during the crash looks likes below from 10.196.76.47
TraceEvent::backtrace(std::string const&) at ??:?
(inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.tcc:613
(inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> > std::operator+<char, std::char_traits<char>, std::allocator<char> >(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:5958
(inlined by) TraceEvent::backtrace(std::string const&) at /opt/foundation/foundationdb/flow/Trace.cpp:938
TraceEvent::~TraceEvent() at ??:?
(inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3631
(inlined by) TraceEvent::~TraceEvent() at /opt/foundation/foundationdb/flow/Trace.cpp:947
criticalError at /opt/foundation/foundationdb/flow/Platform.cpp:2512
platform::outOfMemory() at /opt/foundation/foundationdb/flow/Platform.cpp:2501
operator new(unsigned long) at ??:?
ArenaBlock::addUsed(int) at /opt/foundation/foundationdb/./flow/FastRef.h:75
(inlined by) ArenaBlock::allocate(Reference<ArenaBlock>&, int) at /opt/foundation/foundationdb/./flow/Arena.h:216
(inlined by) operator new[](unsigned long, Arena&) at /opt/foundation/foundationdb/./flow/Arena.h:337
VectorRef<unsigned char, (VecSerStrategy)0>::reallocate(Arena&, int) at /opt/foundation/foundationdb/./flow/flow.h:529
(inlined by) VectorRef<unsigned char, (VecSerStrategy)0>::reserve(Arena&, int) at /opt/foundation/foundationdb/./flow/Arena.h:1036
(inlined by) commitMessages(TLogData*, Reference<LogData>, long, std::vector<TagsAndMessage, std::allocator<TagsAndMessage> > const&) at /opt/foundation/foundationdb/fdbserver/TLogServer.actor.cpp:1091
Reference<LogData>::~Reference() at /opt/foundation/foundationdb/./flow/flow.h:529
(inlined by) commitMessages(TLogData*, Reference<LogData>, long, Arena, StringRef) at /opt/foundation/foundationdb/fdbserver/TLogServer.actor.cpp:1171
Reference<LogData>::~Reference() at /opt/foundation/foundationdb/./flow/flow.h:529
(inlined by) a_body1cont5 at /opt/foundation/foundationdb/fdbserver/TLogServer.actor.cpp:1766
(inlined by) a_body1cont3break1 at /opt/foundation/foundationdb/.objs/fdbserver/TLogServer.actor.g.cpp:8330
(inlined by) a_body1cont3loopBody1 at /opt/foundation/foundationdb/.objs/fdbserver/TLogServer.actor.g.cpp:8301
a_body1cont3loopHead1 at /opt/foundation/foundationdb/./flow/flow.h:529
(inlined by) a_body1cont3 at /opt/foundation/foundationdb/.objs/fdbserver/TLogServer.actor.g.cpp:8159
(inlined by) a_body1cont1 at /opt/foundation/foundationdb/.objs/fdbserver/TLogServer.actor.g.cpp:8109
a_body1when1 at /opt/foundation/foundationdb/./flow/flow.h:529
(inlined by) a_body1 at /opt/foundation/foundationdb/.objs/fdbserver/TLogServer.actor.g.cpp:8066
(inlined by) TLogCommitActor at /opt/foundation/foundationdb/.objs/fdbserver/TLogServer.actor.g.cpp:8494
(inlined by) tLogCommit(TLogData* const&, TLogCommitRequest const&, Reference<LogData> const&, PromiseStream<Void> const&) at /opt/foundation/foundationdb/fdbserver/TLogServer.actor.cpp:1722
(anonymous namespace)::ServeTLogInterfaceActorState<(anonymous namespace)::ServeTLogInterfaceActor>::a_body1loopBody1when4(TLogCommitRequest const&, int) at /opt/foundation/foundationdb/./flow/flow.h:529
ReplyPromise<TLogCommitReply>::~ReplyPromise() at /opt/foundation/foundationdb/./flow/flow.h:529
(inlined by) TLogCommitRequest::~TLogCommitRequest() at /opt/foundation/foundationdb/./fdbserver/TLogInterface.h:232
(inlined by) fire at /opt/foundation/foundationdb/./flow/flow.h:1008
void NotifiedQueue<TLogCommitRequest>::send<TLogCommitRequest>(TLogCommitRequest&&) at /opt/boost_1_67_0/boost/variant/detail/forced_return.hpp:31
(inlined by) NetNotifiedQueue<TLogCommitRequest>::receive(ArenaObjectReader&) at /opt/foundation/foundationdb/./fdbrpc/fdbrpc.h:253
Reference<ArenaBlock>::~Reference() at /opt/foundation/foundationdb/./flow/flow.h:529
(inlined by) Arena::~Arena() at /opt/foundation/foundationdb/./flow/Arena.h:92
(inlined by) ArenaObjectReader::~ArenaObjectReader() at /opt/foundation/foundationdb/./flow/ObjectSerializer.h:117
(inlined by) a_body1cont1 at /opt/foundation/foundationdb/fdbrpc/FlowTransport.actor.cpp:651
ActorCallback<(anonymous namespace)::DeliverActor, 0, Void>::fire(Void const&) at /opt/foundation/foundationdb/./flow/flow.h:529
void SAV<Void>::send<Void>(Void&&) at /opt/boost_1_67_0/boost/variant/detail/forced_return.hpp:31
Promise<Void>::~Promise() at /opt/foundation/foundationdb/flow/TDMetric.actor.h:1269
(inlined by) N2::PromiseTask::~PromiseTask() at /opt/foundation/foundationdb/flow/Net2.actor.cpp:475
(inlined by) N2::PromiseTask::operator()() at /opt/foundation/foundationdb/flow/Net2.actor.cpp:482
(inlined by) N2::Net2::run() at /opt/foundation/foundationdb/flow/Net2.actor.cpp:657
main at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3346
The trace files of the fdbserver from 10.196.76.47 is here.
Cluster Details is like below
Using cluster file `/etc/foundationdb/fdb.cluster'.
Configuration:
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 3
Desired Proxies - 1
Desired Logs - 2
Cluster:
FoundationDB processes - 6
Zones - 3
Machines - 3
Memory availability - 7.5 GB per process on machine with least available
Retransmissions rate - 1 Hz
Fault Tolerance - 1 machine
Server time - 04/20/20 14:53:54
Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 448.312 GB
Disk space used - 1.594 TB
Operating space:
Storage server - 469.4 GB free on most full server
Log server - 333.7 GB free on most full server
Workload:
Read rate - 1861 Hz
Write rate - 578 Hz
Transactions started - 778 Hz
Transactions committed - 94 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
10.196.76.47:4500 ( 26% cpu; 14% machine; 0.062 Gbps; 29% disk IO; 6.9 GB / 11.0 GB RAM )
10.196.76.47:4501 ( 1% cpu; 14% machine; 0.062 Gbps; 29% disk IO; 0.2 GB / 11.0 GB RAM )
10.196.76.49:4500 ( 14% cpu; 15% machine; 0.067 Gbps; 22% disk IO; 6.8 GB / 7.5 GB RAM )
10.196.76.49:4501 ( 16% cpu; 15% machine; 0.067 Gbps; 18% disk IO; 0.4 GB / 7.5 GB RAM )
10.196.76.56:4500 ( 31% cpu; 18% machine; 0.065 Gbps; 39% disk IO; 6.8 GB / 10.4 GB RAM )
10.196.76.56:4501 ( 2% cpu; 18% machine; 0.065 Gbps; 44% disk IO; 0.2 GB / 10.4 GB RAM )
Coordination servers:
10.196.76.47:4500 (reachable)
10.196.76.49:4501 (reachable)
10.196.76.56:4500 (reachable)
Client time: 04/20/20 14:53:54
Just before the crash, I am observing few Slow Tasks like below
<Event Severity="30" Time="1587096940.191483" Type="SlowTask" ID="0000000000000000" TaskID="20000" MClocks="530.525" Duration="0.241148" SampleRate="1" NumYields="0" Machine="10.196.76.47:4500" LogGroup="default" Roles="CD,SS,TL" />
<Event Severity="30" Time="1587097227.580954" Type="SlowTask" ID="0000000000000000" TaskID="20000" MClocks="514.249" Duration="0.23375" SampleRate="1" NumYields="0" Machine="10.196.76.47:4500" LogGroup="default" Roles="CD,SS,TL" />
<Event Severity="30" Time="1587097279.286559" Type="SlowTask" ID="0000000000000000" TaskID="20000" MClocks="882.94" Duration="0.401337" SampleRate="1" NumYields="2" Machine="10.196.76.47:4500" LogGroup="default" Roles="CD,SS,TL" />
I am observing the similar crash in 10.196.76.49
server also. The trace file is here.
Can someone help me understand the backtrace above and what could be happening here?