I think the log queues are growing here because the storage queues are. This is because anything in flight on the storage servers is as in flight on the logs (the logs can’t remove data from their queues until the storage servers that need that data have made it durable).
My guess at the diverging behavior is that each of the processes is experiencing roughly the same problem, but with slight variations in rate and start time. Depending on your configuration, some number of storage queues will reach ~1GB and then level off as ratekeeper kicks in. Any storage server that’s doing a little bit better than the limiting storage server will likely then have its queue fall back down since it’s able to do just a bit more than is required of it.
The remaining question is why your disks suddenly can’t keep up after several minutes of running. One possibility is that this is a property of the disk, for example due to SSD garbage collection starting after a while. We’ve seen many cases of disks behaving markedly worse after periods of sustained load, and that explanation would match the behavior here reasonably well. How busy are the disks during your test?
If you are starting the test from an empty or small database, it may also be possible that the performance regime changes as your b-tree gets larger than cache, resulting in an increased number of reads going to disk (I’m assuming the
ssd storage engine). It feels a little bit like that should appear as a more gradual degradation, but maybe not. If you are running your tests from empty, you could potentially eliminate this effect by pre-loading your database to a decent size. You could also limit the writes in your test to existing keys, in which case the database shouldn’t change in size so long as your value sizes are similar.
Nothing else immediately comes to mind to explain why performance would degrade over time. Assuming the problem isn’t something bad going on in the cluster, the resolution to this is likely going to be that you’ll either need to decrease your load or increase your computing resources. It seems like the storage servers do ok at 1/3 load (from the 10 instance 1x redundancy test), so running 30 instances at 3x redundancy seems likely to work.