FoundationDB

Trouble Understanding Roughness / Hotspotting


(Ricky Saltzer) #1

Hey there -

I’ve currently got a continuous streaming pipeline writing to our FDB cluster, and after about 12 hours or so, I started to see some issues pop up.

Our workload is a pretty consistent 3000-6000Hz write rate, and 300-500Hz read rate. The vast majority of the operations I’m performing are Atomic Mutations (MIN, MAX, ADD, APPEND_IF_FITS). With one that performs a read, merge, and write since the data is a custom Array[Byte] sequence

I’m confident that our cluster topology is suboptimal, as we’re still learning the ins and outs before rolling out a production.

The problem I’m seeing as of this morning is a very high roughness value (1453.33) which according to the Machine Stats documentation, is an indicator of latency, and storage queue problems.

Each time I run a “status” it takes a while to come back, with some servers not included. I believe this is due to the fact we naively put coordinator processes on storage servers :blush:. That being said, each time I look at the status, there will be two or three (out of 16) storage servers that have their disk IO pegged at 99%. It looks like the same storage servers each time. However, their Gbps rate is the same as the other nodes which have 1-5% disk IO.

Is this an indication that my workload is hotspotting?

Our key is made up as follows (Tuple)

(string_literal(aggregations), string_literal(metrics), md5(metric_name), reverse(timestamp), string(metric_name), int(agg_type))

I was hoping that the third element md5(metric_name) would afford us some even distribution. The only thing I can think of is that we have specific metrics that are very hot compared to others.

Any advice you guys might have would be super valuable to us.

Thanks in advance,
Ricky
Epic Games


(Ricky Saltzer) #2

To follow up on this, as it turns out, we were experiencing extremely poor performance using Amazon EBS volumes. We have since switched to using instances with hardware attached NVME drives and have seen a significant performance / stability improvement.