How to troubleshoot throughput performance degrade?

You definitely need more than 4 prefixes. With only 4 prefixes and triple replication, only (4 x 3 replicas) 12 storage processes (about ~10% of the processes in your case) are writing at any given time. This means one will fall behind, the shard gets split because it is too hot, the next one falls behind, etc. Also you don’t get high enough queue depth to saturate the devices.

I would also try out the memory engine if you are going to be rapidly deleting data after you receive it.