Deadlock caused by Profiler Thread in storage server

Hi everyone. I have encountered a deadlock problem when I write massive data to a v6.3.25 foundationdb cluster.
During transmitting data to cluster, a few storage fdbservers do not respond any request and then they are considered offline. So I check stacktrace of these servers. Here is the stacktrace.

Thread 4 (Thread 0x7f8d809a1700 (LWP 375283)):
#0  0x00007f8d80d7ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000001f90c9e in ThreadPool::Thread::run() ()
#2  0x0000000001f91579 in ThreadPool::start(void*) ()
#3  0x00007f8d80d77ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f8d80aa0b0d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f8d7bfff700 (LWP 375289)):
#0  0x00007f8d80d7ee9d in nanosleep () from /lib64/libpthread.so.0
#1  0x0000000001f15085 in threadSleep(double) ()
#2  0x0000000001f15159 in checkThread(void*) ()
#3  0x00007f8d80d77ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f8d80aa0b0d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f8d8275f700 (LWP 375366)):
#0  0x00007f8d80d7ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000076bef7 in etp_proc ()
#2  0x00007f8d80d77ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f8d80aa0b0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f8d8274c0c0 (LWP 375262)):
#0  0x00007f8d80d7e54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f8d80d79e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f8d80d79d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f8d80f9da8a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#4  0x00007f8d80f9ad8c in ?? () from /lib64/libgcc_s.so.1
#5  0x00007f8d80f9b74d in ?? () from /lib64/libgcc_s.so.1
#6  0x00007f8d80f9bfe8 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#7  0x00007f8d80ab6cd6 in backtrace () from /lib64/libc.so.6
#8  0x0000000001f13f32 in profileHandler(int) ()
#9  <signal handler called>
#10 0x00007f8d80d79d62 in pthread_mutex_lock () from /lib64/libpthread.so.0
#11 0x00007f8d80f9da8a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#12 0x00007f8d80f9ad8c in ?? () from /lib64/libgcc_s.so.1
#13 0x00007f8d80f9bc33 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#14 0x00007f8d81502c76 in __cxa_throw () from /lib64/libstdc++.so.6
#15 0x0000000001c246bb in RequestData<GetKeyValuesRequest>::checkAndProcessResult(bool) ()
#16 0x0000000001bddf9b in (anonymous namespace)::LoadBalanceActorState<StorageServerInterface, GetKeyValuesRequest, ReferencedInterface<StorageServerInterface>, (anonymous namespace)::LoadBalanceActor<StorageServerInterface, GetKeyValuesRequest, ReferencedInterface<StorageServerInterface> > >::a_body1loopBody1loopBody2when1(ErrorOr<GetKeyValuesReply> const&, int) ()
#17 0x0000000001bde1d4 in ActorCallback<(anonymous namespace)::LoadBalanceActor<StorageServerInterface, GetKeyValuesRequest, ReferencedInterface<StorageServerInterface> >, 6, ErrorOr<GetKeyValuesReply> >::fire(ErrorOr<GetKeyValuesReply> const&) ()
#18 0x00000000014e77f0 in ActorCallback<(anonymous namespace)::WaitValueOrSignalActor<GetKeyValuesReply>, 0, GetKeyValuesReply>::fire(GetKeyValuesReply const&) ()
#19 0x00000000007cac78 in NetSAV<GetKeyValuesReply>::receive(ArenaObjectReader&) ()
#20 0x0000000001dbbe05 in (anonymous namespace)::DeliverActorState<(anonymous namespace)::DeliverActor>::a_body1cont1(int) [clone .isra.976] ()
#21 0x0000000001dbc13b in ActorCallback<(anonymous namespace)::DeliverActor, 0, Void>::fire(Void const&) ()
#22 0x0000000000818770 in void SAV<Void>::send<Void>(Void&&) ()
#23 0x0000000001eeb81e in N2::Net2::run() ()
#24 0x0000000000783a8f in main ()

As we can see, handling c++ exceptions is interrupted by a signal. The profileHandler indicates this signal is SIGPROF raised by Profiler Thread. Handling exceptions and handling signals are both locking the same non-reentrant mutex in _Unwind_Find_FDE when unwinding stacktrace. So fdbserver encounters a deadlock.
My questions are:
(1) Now I do not setup profiler thread to avoid this case. Is there any other potential problem if I close Profiler Thread?
(2) In foundationdb 7.x, there is a better backtrace implemented by absl::GetStackTrace. While in profileHandler, the original backtrace is invoked and we may also encounter this deadlock. So is there any plan to fix this invocation? And is there any possibility to make this better backtrace implementation backport to foundationdb 6.x?

Good catch! Looks like profileHandler should switch to use the same absl::GetStackTrace if available.

Were you by any chance using a sanitizer? There is a workaround to this problem employed by FDB, but it does not work in sanitizers. There may be other builds that it doesn’t work in either.

The absl stack trace has other problems that made it not work in this use case. The source is those issues hasn’t been discovered yet, and until then the existing backtrace works better in practice.

Something I just noticed while revisiting this is that this line originally referred to calling the libc backtrace implementation, but now it’s calling the absl implementation, which presumably does not have the originally-intended effect.

This shouldn’t be a problem. The slow task profiler is a diagnostic tool to help identify certain performance problems, and isn’t necessary for normal operation.

I agree with AJ - 6.3.25 is believed to have this problem mitigated (except in sanitizer builds). Can you share a bit more about whether or not you built from source or downloaded an official binary, and what environment you’re running in? Maybe what libc implementation?

This shouldn’t be a problem. The slow task profiler is a diagnostic tool to help identify certain performance problems, and isn’t necessary for normal operation.

Thanks a lot for this reply. When I close profiler, the problem seems to be disappeared.

There is something more I could share. I built FDB from source without sanitizers. The building and running environment is:

OS: CentOS 7
Compiler: gcc 8.3.1 in devtoolset-8
Glibc version: 2.17
Build type: RelWithDebInfo

I have learned about glibc-2.17 source and then find backtrace in libgcc may encounter deadlock when using AddressSanitizer. But my deadlock case is not caused by sanitizer. So is there any other ideas about this case? Or is there any other information I can provide?

No. I built FDB without sanitizers.

Thanks for your reply.
Closing profiler seems to be a solution to this case now, but I think throwing this issues in forum and trying to find a solution can help to make FDB better.