Hi everyone. I have encountered a deadlock problem when I write massive data to a v6.3.25 foundationdb cluster.
During transmitting data to cluster, a few storage fdbservers do not respond any request and then they are considered offline. So I check stacktrace of these servers. Here is the stacktrace.
Thread 4 (Thread 0x7f8d809a1700 (LWP 375283)):
#0 0x00007f8d80d7ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000001f90c9e in ThreadPool::Thread::run() ()
#2 0x0000000001f91579 in ThreadPool::start(void*) ()
#3 0x00007f8d80d77ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f8d80aa0b0d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f8d7bfff700 (LWP 375289)):
#0 0x00007f8d80d7ee9d in nanosleep () from /lib64/libpthread.so.0
#1 0x0000000001f15085 in threadSleep(double) ()
#2 0x0000000001f15159 in checkThread(void*) ()
#3 0x00007f8d80d77ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f8d80aa0b0d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f8d8275f700 (LWP 375366)):
#0 0x00007f8d80d7ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000076bef7 in etp_proc ()
#2 0x00007f8d80d77ea5 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f8d80aa0b0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f8d8274c0c0 (LWP 375262)):
#0 0x00007f8d80d7e54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f8d80d79e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007f8d80d79d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f8d80f9da8a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#4 0x00007f8d80f9ad8c in ?? () from /lib64/libgcc_s.so.1
#5 0x00007f8d80f9b74d in ?? () from /lib64/libgcc_s.so.1
#6 0x00007f8d80f9bfe8 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#7 0x00007f8d80ab6cd6 in backtrace () from /lib64/libc.so.6
#8 0x0000000001f13f32 in profileHandler(int) ()
#9 <signal handler called>
#10 0x00007f8d80d79d62 in pthread_mutex_lock () from /lib64/libpthread.so.0
#11 0x00007f8d80f9da8a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#12 0x00007f8d80f9ad8c in ?? () from /lib64/libgcc_s.so.1
#13 0x00007f8d80f9bc33 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#14 0x00007f8d81502c76 in __cxa_throw () from /lib64/libstdc++.so.6
#15 0x0000000001c246bb in RequestData<GetKeyValuesRequest>::checkAndProcessResult(bool) ()
#16 0x0000000001bddf9b in (anonymous namespace)::LoadBalanceActorState<StorageServerInterface, GetKeyValuesRequest, ReferencedInterface<StorageServerInterface>, (anonymous namespace)::LoadBalanceActor<StorageServerInterface, GetKeyValuesRequest, ReferencedInterface<StorageServerInterface> > >::a_body1loopBody1loopBody2when1(ErrorOr<GetKeyValuesReply> const&, int) ()
#17 0x0000000001bde1d4 in ActorCallback<(anonymous namespace)::LoadBalanceActor<StorageServerInterface, GetKeyValuesRequest, ReferencedInterface<StorageServerInterface> >, 6, ErrorOr<GetKeyValuesReply> >::fire(ErrorOr<GetKeyValuesReply> const&) ()
#18 0x00000000014e77f0 in ActorCallback<(anonymous namespace)::WaitValueOrSignalActor<GetKeyValuesReply>, 0, GetKeyValuesReply>::fire(GetKeyValuesReply const&) ()
#19 0x00000000007cac78 in NetSAV<GetKeyValuesReply>::receive(ArenaObjectReader&) ()
#20 0x0000000001dbbe05 in (anonymous namespace)::DeliverActorState<(anonymous namespace)::DeliverActor>::a_body1cont1(int) [clone .isra.976] ()
#21 0x0000000001dbc13b in ActorCallback<(anonymous namespace)::DeliverActor, 0, Void>::fire(Void const&) ()
#22 0x0000000000818770 in void SAV<Void>::send<Void>(Void&&) ()
#23 0x0000000001eeb81e in N2::Net2::run() ()
#24 0x0000000000783a8f in main ()
As we can see, handling c++ exceptions is interrupted by a signal. The profileHandler
indicates this signal is SIGPROF raised by Profiler Thread. Handling exceptions and handling signals are both locking the same non-reentrant mutex in _Unwind_Find_FDE
when unwinding stacktrace. So fdbserver encounters a deadlock.
My questions are:
(1) Now I do not setup profiler thread to avoid this case. Is there any other potential problem if I close Profiler Thread?
(2) In foundationdb 7.x, there is a better backtrace
implemented by absl::GetStackTrace
. While in profileHandler
, the original backtrace
is invoked and we may also encounter this deadlock. So is there any plan to fix this invocation? And is there any possibility to make this better backtrace
implementation backport to foundationdb 6.x?