Knobs/strategies to get around storage server write queue size error?

(gaurav) #1


In my data model, there are a couple of “hot” storage servers that are getting a lot of read requests, causing their CPU usage to get pegged at 100% for long durations. This, in turn, causes the storage queues on these servers to grow (due to reads having higher priority than writes), leading to the cluster reporting cluster.qos.performance_limited_by as storage_server_write_queue_size.

While I am working independently on changing the access patterns to resolve these hotspots, I wanted to understand a couple of things about this:

  • What is the effect of this cluster state? Will it cause the rate-keeper to slow down the rate at which transactions can be created (in other words: the rate at which read_version can be obtained)?
  • Are there any knobs to increase the priority of writes so that cluster can keep draining the queue at a faster rate (probably at the expense of slowing down reads/transactions uniformly over time)? Are there any other potential downsides or risks to playing around with read vs write request priority knobs?


(A.J. Beamon) #2

Yes, if storage servers in multiple fault domains have large queues (I think this starts around 900MB), then ratekeeper will limit the rate that transactions are started.

I’m not aware of any knobs to flip the prioritization of reads vs. writes on the storage servers. There are priorities set for the different activities which could in theory be changed with a recompile, though. Our rationale for not doing so is that it is our goal to concentrate all of the added latency of a transaction in the get read version request. If we succeed in that, then once a transaction is started, it should be able to do the rest of its operations with more normal latencies and finish in a reasonable amount of time. Otherwise, as read latencies start to creep up, you run the risk of your transactions being unable to complete within the 5 second window and timing out. In the case of a hot read shard, that could mean that your transactions which are reading from the hot shard get starved.

That said, I think Snowflake may have experimented with this (I’m not certain, though, you’d probably have to check with them). I don’t think we’ve ever run in that kind of regime, so they may have additional insights from experience.

(Alex Miller) #3

One of the items on Snowflake’s todo list was to see if this sort of thing still occurs on 6.0, and it wasn’t something that was fixed along the way from 3.0. We had discussed the “panic” reprioritization of writes to be higher than reads if queues build too high as a potential item to be merged in from their changes.

@gaurav, are you running on EBS (or another cloud provider’s IOPS-limited block storage solution)?

And let’s explicitly ping @markus.pilman @andrew.noyes , as this is likely relevant and interesting for them.

(gaurav) #4

Thanks, @ajbeamon for clarifications!

@alexmiller, yes, I am trying this out on a cluster of 7 m4.4xlarge EC2 instances, each having an attached EBS General Purpose SSD ( gp2) type volume with 3,000+ provisioned IOPS.

However, I can confirm that our write rate is much smaller (under 1000 IOPS - as checked from AWS monitoring graphs), and we are never throttled on disk IOPS by AWS.

From monitoring the fdb json output over long durations, I can clearly observe that as soon as SS CPU load comes down (for small intervals), it starts to make the queue durable, and reduces the queue size. But as soon as the read loads resume, the durable byte rate goes down to 0hz, and the queue starts to build up.

I also tried to temporarily reduce our read requests that I suspect to be the issue, and it immediately resolved the SS queue size issue (write rates were unchanged). So, I am positive here, that the buildup in SS queue is not due to excessive writes, but rather due to excessive read induced CPU load on these processes.