Trouble Understanding Roughness / Hotspotting

ricky.saltzer · October 16, 2018, 2:58pm

Hey there -

I’ve currently got a continuous streaming pipeline writing to our FDB cluster, and after about 12 hours or so, I started to see some issues pop up.

Our workload is a pretty consistent 3000-6000Hz write rate, and 300-500Hz read rate. The vast majority of the operations I’m performing are Atomic Mutations (MIN, MAX, ADD, APPEND_IF_FITS). With one that performs a read, merge, and write since the data is a custom Array[Byte] sequence

I’m confident that our cluster topology is suboptimal, as we’re still learning the ins and outs before rolling out a production.

The problem I’m seeing as of this morning is a very high roughness value (1453.33) which according to the Machine Stats documentation, is an indicator of latency, and storage queue problems.

Each time I run a “status” it takes a while to come back, with some servers not included. I believe this is due to the fact we naively put coordinator processes on storage servers . That being said, each time I look at the status, there will be two or three (out of 16) storage servers that have their disk IO pegged at 99%. It looks like the same storage servers each time. However, their Gbps rate is the same as the other nodes which have 1-5% disk IO.

Is this an indication that my workload is hotspotting?

Our key is made up as follows (Tuple)

(string_literal(aggregations), string_literal(metrics), md5(metric_name), reverse(timestamp), string(metric_name), int(agg_type))

I was hoping that the third element md5(metric_name) would afford us some even distribution. The only thing I can think of is that we have specific metrics that are very hot compared to others.

Any advice you guys might have would be super valuable to us.

Thanks in advance,
Ricky
Epic Games

ricky.saltzer · October 24, 2018, 6:11pm

To follow up on this, as it turns out, we were experiencing extremely poor performance using Amazon EBS volumes. We have since switched to using instances with hardware attached NVME drives and have seen a significant performance / stability improvement.

Topic		Replies	Views
Troubleshooting hot keys/prefixes (or, viewing the shard->storage process mapping) Using FoundationDB	4	182	April 17, 2024
Transaction/operation throughput Using FoundationDB performance	10	1973	January 23, 2020
Performance tuning and availability Running FoundationDB performance	0	366	June 10, 2022
Optimizing FoundationDB Performance for Large-Scale Data Processing Running FoundationDB	1	191	July 4, 2024
Storage queue limiting performance when initially loading data Using FoundationDB	10	2728	October 14, 2019

Trouble Understanding Roughness / Hotspotting

Related topics