Troubleshooting hot keys/prefixes (or, viewing the shard->storage process mapping)

danm · April 16, 2024, 4:59pm

We’ve got an issue at the moment where the durability lag on 2 of our storage processes is going very high (anything from 5 to 30 minutes) and disrupting the cluster’s ability to serve queries for our system.

I have a theory that this is due to a specific prefix in our keyspace all being in the same shard, and that shard being hosted on that storage process currently. We’re doing a lot of reads of that keyspace, but only a ‘normal’ (for our cluster) amount of writes. Because we’re only on FDB 7.1 the reads don’t count for deciding when to shard data, so I think the reads are being bottlenecked by the one process/host, and the writes are also suffering a lot and durability lag is rising because they’re sequential (If I’m understanding Storage servers per disk · apple/foundationdb Wiki · GitHub correctly) and this is a cloud system where latency between instance and disk is quite high and is our bottleneck even though we have plenty of ‘spare’ throughput and iops.

We’ve already used the FDB metric keys under /metrics/data_distribution_stats to work out that we’ve probably got a ‘hot’ prefix because it looks like 1 shard holds the vast amount of data that we’d ever read (we write a lot of accounting information to other keys that are distributed across shards, but the system doesn’t need to read them much as part of normal operation).

I’d like to be able to use the FDB metric keys or whatever to:

map a shard to the process(es) currently serving it, so that I can see if it tracks with our overloaded storage nodes.
have a function that I can provide a key and it’ll tell me which shard/process that key is on.
Ideally give me a breakdown of % read and writes by shard.

Is any of that possible?

danm · April 16, 2024, 5:27pm

Additionally, we’re running in three_data_hall mode, so I’m confused why we see specifically 2 storage processes getting overloaded when my understanding is we should have 3 replicas of our data.

If all reads were going to a ‘primary’ node (as with Kafka topic partitions), I’d expect 1 node to be much more heavily loaded and 2 others to be midway between baseline and that one (they get the update writes, but not the reads).

If reads can go to any replica, I’d expect 3 nodes all at about the same level of ‘overloading’, not just 2.

So why do we consistently see 2 nodes overloaded vastly more than any others in the cluster?

danm · April 17, 2024, 8:46am

Hmm, looks like the locality API (Python API — FoundationDB 7.1) might be what I want to work out what shard is on which process currently, using the first key in each of the shards?

danm · April 17, 2024, 4:16pm

Locality API seems to be slow as hell. I can see there’s also the /serverKeys special key space, which should give us a mapping of all of the shard starting keys that a server is currently responsible for and is much faster. But I don’t yet know how to decode the serverId part of that key. Anyone?

SteavedHams · April 17, 2024, 5:33pm

the writes are also suffering a lot and durability lag is rising because they’re sequential

This is true for the default ssd-2 storage engine but not for ssd-redwood-1 which can provide up to 10x higher write throughput or more on network attached block devices due to parallelization of IOs on the update path.

Using Redwood won’t fix hotspotting fundamentally (data movement must do that) but it will give you far more write and read throughput per Storage Server.

Topic		Replies	Views
How to troubleshoot throughput performance degrade? Using FoundationDB performance	35	4336	June 20, 2019
Keyspace partitions & performance Using FoundationDB	8	5838	April 22, 2018
Storage Server CPU bottleneck - Growing data lag Using FoundationDB performance	22	3029	December 13, 2021
Accessing key metadata for manual sharding Using FoundationDB	3	499	October 17, 2022
Can I force key prefixes into different shards? Using FoundationDB	1	214	April 18, 2024

Troubleshooting hot keys/prefixes (or, viewing the shard->storage process mapping)

Related topics