Storage server spending large amount of CPU time in the network stack

mpatou_openai · November 26, 2024, 4:04pm

In production in one of our cluster in 7.3.43 (our clusters handle different kind of workload so 1 is not comparable to another) we are seeing a lot of CPU (90% +) used on some storage server. I did some flamegraph profiling (__run_timers.part.0 (250,000 samples, 0.01%)) and it seems that we spend quite a lot of time in CPU sending and receiving network data for instance _libc_recv is taking close to 10% CPU and _sys_sendmsg takes 25%.

I’m wondering if there is tuning that is needed to reduce the share of CPU done doing network operations I haven’t found really guidelines.

jzhou · December 7, 2024, 8:27pm

Don’t know if there are tunings we can do. It might be worth looking more into the number of requests and bandwidth used by the storage server. StorageMetrics event has information about the number of requests. If the storage server is having more traffic than others, maybe the problem is related to “hot” shard (This tool foundationdb/contrib/transaction_profiling_analyzer/transaction_profiling_analyzer.py at main · apple/foundationdb · GitHub can help debug hot shard)?

Topic		Replies	Views
Storage Server CPU bottleneck - Growing data lag Using FoundationDB performance	22	2987	December 13, 2021
Cluster Performance Issue (7.1.43) Using FoundationDB performance	6	396	January 30, 2024
Storage queue limiting performance when initially loading data Using FoundationDB	10	2703	October 14, 2019
FoundationDB cluster performance issue - Periods of high disk I/O and sustained high latency Using FoundationDB performance	21	2510	July 6, 2020
Optimizing FoundationDB Performance for Large-Scale Data Processing Running FoundationDB	1	177	July 4, 2024

Storage server spending large amount of CPU time in the network stack

Related topics