Can you try to debug this? Would be interesting to know what bottleneck you’re hitting there. As SS is spiking, I assume the TLogs are fine. It would be awesome if you could try the following:
- Replace all clear and clearrange mutations with
set
mutations. If you hit the same issue, it will tell us that you simply run into memory pressure issues with the SS. - Replace your many small clear range mutations with few large ones. Not sure how easy that will be in your workload - but generally I would assume that this would increase performance significantly.
- Do you know how the CPU utilization looks like when SS goes up? And what about disk?
- If CPU is very high during these periods you can attach
perf
to the storage during one of these spikes. You can doperf record -p PID -g
. You can then post the results here.