Deadlock caused by Profiler Thread in storage server

GestureWei · April 23, 2023, 12:01am

Hi everyone. I have encountered a deadlock problem when I write massive data to a v6.3.25 foundationdb cluster.
During transmitting data to cluster, a few storage fdbservers do not respond any request and then they are considered offline. So I check stacktrace of these servers. Here is the stacktrace.

Thread 4 (Thread 0x7f8d809a1700 (LWP 375283)):
#0  0x00007f8d80d7ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000001f90c9e in ThreadPool::Thread::run() ()
#2  0x0000000001f91579 in ThreadPool::start(void*) ()
#3  0x00007f8d80d77ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f8d80aa0b0d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f8d7bfff700 (LWP 375289)):
#0  0x00007f8d80d7ee9d in nanosleep () from /lib64/libpthread.so.0
#1  0x0000000001f15085 in threadSleep(double) ()
#2  0x0000000001f15159 in checkThread(void*) ()
#3  0x00007f8d80d77ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f8d80aa0b0d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f8d8275f700 (LWP 375366)):
#0  0x00007f8d80d7ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000076bef7 in etp_proc ()
#2  0x00007f8d80d77ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f8d80aa0b0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f8d8274c0c0 (LWP 375262)):
#0  0x00007f8d80d7e54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f8d80d79e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f8d80d79d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f8d80f9da8a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#4  0x00007f8d80f9ad8c in ?? () from /lib64/libgcc_s.so.1
#5  0x00007f8d80f9b74d in ?? () from /lib64/libgcc_s.so.1
#6  0x00007f8d80f9bfe8 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#7  0x00007f8d80ab6cd6 in backtrace () from /lib64/libc.so.6
#8  0x0000000001f13f32 in profileHandler(int) ()
#9  <signal handler called>
#10 0x00007f8d80d79d62 in pthread_mutex_lock () from /lib64/libpthread.so.0
#11 0x00007f8d80f9da8a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#12 0x00007f8d80f9ad8c in ?? () from /lib64/libgcc_s.so.1
#13 0x00007f8d80f9bc33 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#14 0x00007f8d81502c76 in __cxa_throw () from /lib64/libstdc++.so.6
#15 0x0000000001c246bb in RequestData<GetKeyValuesRequest>::checkAndProcessResult(bool) ()
#16 0x0000000001bddf9b in (anonymous namespace)::LoadBalanceActorState<StorageServerInterface, GetKeyValuesRequest, ReferencedInterface<StorageServerInterface>, (anonymous namespace)::LoadBalanceActor<StorageServerInterface, GetKeyValuesRequest, ReferencedInterface<StorageServerInterface> > >::a_body1loopBody1loopBody2when1(ErrorOr<GetKeyValuesReply> const&, int) ()
#17 0x0000000001bde1d4 in ActorCallback<(anonymous namespace)::LoadBalanceActor<StorageServerInterface, GetKeyValuesRequest, ReferencedInterface<StorageServerInterface> >, 6, ErrorOr<GetKeyValuesReply> >::fire(ErrorOr<GetKeyValuesReply> const&) ()
#18 0x00000000014e77f0 in ActorCallback<(anonymous namespace)::WaitValueOrSignalActor<GetKeyValuesReply>, 0, GetKeyValuesReply>::fire(GetKeyValuesReply const&) ()
#19 0x00000000007cac78 in NetSAV<GetKeyValuesReply>::receive(ArenaObjectReader&) ()
#20 0x0000000001dbbe05 in (anonymous namespace)::DeliverActorState<(anonymous namespace)::DeliverActor>::a_body1cont1(int) [clone .isra.976] ()
#21 0x0000000001dbc13b in ActorCallback<(anonymous namespace)::DeliverActor, 0, Void>::fire(Void const&) ()
#22 0x0000000000818770 in void SAV<Void>::send<Void>(Void&&) ()
#23 0x0000000001eeb81e in N2::Net2::run() ()
#24 0x0000000000783a8f in main ()

As we can see, handling c++ exceptions is interrupted by a signal. The profileHandler indicates this signal is SIGPROF raised by Profiler Thread. Handling exceptions and handling signals are both locking the same non-reentrant mutex in _Unwind_Find_FDE when unwinding stacktrace. So fdbserver encounters a deadlock.
My questions are:
(1) Now I do not setup profiler thread to avoid this case. Is there any other potential problem if I close Profiler Thread?
(2) In foundationdb 7.x, there is a better backtrace implemented by absl::GetStackTrace. While in profileHandler, the original backtrace is invoked and we may also encounter this deadlock. So is there any plan to fix this invocation? And is there any possibility to make this better backtrace implementation backport to foundationdb 6.x?

jzhou · April 23, 2023, 12:10am

Good catch! Looks like profileHandler should switch to use the same absl::GetStackTrace if available.

ajbeamon · April 23, 2023, 10:54pm

Were you by any chance using a sanitizer? There is a workaround to this problem employed by FDB, but it does not work in sanitizers. There may be other builds that it doesn’t work in either.

The absl stack trace has other problems that made it not work in this use case. The source is those issues hasn’t been discovered yet, and until then the existing backtrace works better in practice.

atn34 · April 24, 2023, 5:24pm

Something I just noticed while revisiting this is that this line originally referred to calling the libc backtrace implementation, but now it’s calling the absl implementation, which presumably does not have the originally-intended effect.

This shouldn’t be a problem. The slow task profiler is a diagnostic tool to help identify certain performance problems, and isn’t necessary for normal operation.

I agree with AJ - 6.3.25 is believed to have this problem mitigated (except in sanitizer builds). Can you share a bit more about whether or not you built from source or downloaded an official binary, and what environment you’re running in? Maybe what libc implementation?

GestureWei · April 25, 2023, 2:43am

This shouldn’t be a problem. The slow task profiler is a diagnostic tool to help identify certain performance problems, and isn’t necessary for normal operation.

Thanks a lot for this reply. When I close profiler, the problem seems to be disappeared.

There is something more I could share. I built FDB from source without sanitizers. The building and running environment is:

OS: CentOS 7
Compiler: gcc 8.3.1 in devtoolset-8
Glibc version: 2.17
Build type: RelWithDebInfo

I have learned about glibc-2.17 source and then find backtrace in libgcc may encounter deadlock when using AddressSanitizer. But my deadlock case is not caused by sanitizer. So is there any other ideas about this case? Or is there any other information I can provide?

GestureWei · April 25, 2023, 2:44am

No. I built FDB without sanitizers.

GestureWei · April 25, 2023, 2:52am

Thanks for your reply.
Closing profiler seems to be a solution to this case now, but I think throwing this issues in forum and trying to find a solution can help to make FDB better.

Topic		Replies	Views
Storage Server CPU bottleneck - Growing data lag Using FoundationDB performance	22	3003	December 13, 2021
Troubles scaling up the cluster Using FoundationDB	31	3727	November 1, 2018
Work around "Storage server running out of space (approaching 5% limit)" on your developer machine Using FoundationDB	2	1880	June 7, 2020
FDB cluster freeze Using FoundationDB	12	415	March 22, 2023
Storage queue limiting performance when initially loading data Using FoundationDB	10	2717	October 14, 2019

Deadlock caused by Profiler Thread in storage server

Related topics