Troubleshooting hot keys/prefixes (or, viewing the shard->storage process mapping)

We’ve got an issue at the moment where the durability lag on 2 of our storage processes is going very high (anything from 5 to 30 minutes) and disrupting the cluster’s ability to serve queries for our system.

I have a theory that this is due to a specific prefix in our keyspace all being in the same shard, and that shard being hosted on that storage process currently. We’re doing a lot of reads of that keyspace, but only a ‘normal’ (for our cluster) amount of writes. Because we’re only on FDB 7.1 the reads don’t count for deciding when to shard data, so I think the reads are being bottlenecked by the one process/host, and the writes are also suffering a lot and durability lag is rising because they’re sequential (If I’m understanding Storage servers per disk · apple/foundationdb Wiki · GitHub correctly) and this is a cloud system where latency between instance and disk is quite high and is our bottleneck even though we have plenty of ‘spare’ throughput and iops.

We’ve already used the FDB metric keys under /metrics/data_distribution_stats to work out that we’ve probably got a ‘hot’ prefix because it looks like 1 shard holds the vast amount of data that we’d ever read (we write a lot of accounting information to other keys that are distributed across shards, but the system doesn’t need to read them much as part of normal operation).

I’d like to be able to use the FDB metric keys or whatever to:

  • map a shard to the process(es) currently serving it, so that I can see if it tracks with our overloaded storage nodes.
  • have a function that I can provide a key and it’ll tell me which shard/process that key is on.
  • Ideally give me a breakdown of % read and writes by shard.

Is any of that possible?

Additionally, we’re running in three_data_hall mode, so I’m confused why we see specifically 2 storage processes getting overloaded when my understanding is we should have 3 replicas of our data.

If all reads were going to a ‘primary’ node (as with Kafka topic partitions), I’d expect 1 node to be much more heavily loaded and 2 others to be midway between baseline and that one (they get the update writes, but not the reads).

If reads can go to any replica, I’d expect 3 nodes all at about the same level of ‘overloading’, not just 2.

So why do we consistently see 2 nodes overloaded vastly more than any others in the cluster?

Hmm, looks like the locality API (Python API — FoundationDB 7.1) might be what I want to work out what shard is on which process currently, using the first key in each of the shards?

Locality API seems to be slow as hell. I can see there’s also the /serverKeys special key space, which should give us a mapping of all of the shard starting keys that a server is currently responsible for and is much faster. But I don’t yet know how to decode the serverId part of that key. Anyone?

the writes are also suffering a lot and durability lag is rising because they’re sequential

This is true for the default ssd-2 storage engine but not for ssd-redwood-1 which can provide up to 10x higher write throughput or more on network attached block devices due to parallelization of IOs on the update path.

Using Redwood won’t fix hotspotting fundamentally (data movement must do that) but it will give you far more write and read throughput per Storage Server.

1 Like