Query hotspotting on Directory Layer's metadata subspace

Hey, I use FDB 6.2 and all requests use the directory layer. I noticed I have pretty bad storage process query hotspotting due to the heavy usage of the directory layer. One of my storage processes is always CPU-bound, and its total_queries metric in status json is ~70x the other storage processes. I ran the transaction profiler analyzer to find that all the read requests were in the directory layer nodeSS metadata subspace. From the data distributor internals it seems like data distribution is based solely on the size and write rate, and not the read rate, which is why the entire subspace is on one storage process. I’m curious if anyone has run into this and has any workarounds?

One thing I’ve had to rule out is caching the mapping in memory. I can’t only rely on this because a remote client might delete/move directories and I haven’t found an out-of-the-box way to validate a []byte -> []string directory prefix mapping without going through the directory layer. I can’t store my own reverse mapping (for example, writing the []string directory path at the start of each directory subspace) because:

  1. We have a lot of query patterns that range scan the entire directory, so we can’t write inside an existing directory
  2. We use the default AllKeys for contentSS so it’s unsafe to store keys outside a directory

Is it possible to manually move data? Then I could split the nodeSS shard manually for my most CPU-bound clusters. Also, would anyone ever consider making nodeSS replication a feature in the directory layer? So that when a directory is created/moved/removed it updates all nodeSS-es, and when you fetch a directory subspace it randomly queries one nodeSS.

Just wanted to give a quick update of what we ended up doing–

We checked that the initial key of every directory (equivalent to the dirSubspace.Bytes()) is not being used, which makes sense as we encode everything in tuples. We have an in-memory cache, then at the initial key of every directory wrote the directory path as a reverse mapping. We update the cache and reverse mapping key value when a directory is moved, deleted, or if the reverse mapping value doesn’t match.

So for example, the read path is to get the directory subspace for my/dir is:

  1. Check cache for my/dir, say the value is \x01\x02
  2. Read key \x01\x02 and confirm that the value is my/dir
  3. If the value does not match, get the subspace from the directory layer - say \x01\x03 Then update the in-memory cache and if the tx is not read-only, write my/dir at key \x01\x03

It is still the same amount of reads but now much more distributed as it isn’t hitting the directory layer shard upon every request. Our CPU usage hotspotting has greatly improved since we enabled this.

Also came across the Consistent Caching talk (Consistent Caching in FoundationDB - Xin Dong, Apple & Neelam Goyal, Snowflake - YouTube) which looks like it would totally take care of this. Super exciting stuff! Any idea when this will be released? Is it part of 6.3?

1 Like

That’s an interesting work around.

I’ve had many issues other the years with the Directory Layer, related to sequential read latency required to get a subspace prefix, and issues related to caching (which is unsafe if not done very carefully), and tried multiple approaches to fix this, including breaking changes to the DL API. I ended rolling back these changes because it made it unsafe to use in combination with other tools (that don’t know about the API changes).

The best compromise solution I found was to defer the reads to the DL until the end of the transaction so as to get rid of the extra latency, and only commit the transaction if the result matches what the cache expected (and if not, retry the transaction). But this still induces a hot spot in the DL’s key subspace.

Your idea of adding a reverse key into each subspace is technically a breaking change to the DL contract, meaning that any external client or tool that would create a directory subspace by itself would not insert this key, and your cache check would fail. Note that the DL api in most binding usually allow the caller to implicitly create the directory if it is missing, so even if the tool was only intending to “read”, it may accidentally cause the subspace to be created if it runs before your application deployment script.

Also, there are some layers that use the subspace prefix key itself to store some sort of metadata, so that would collide with your reverse mapping key:

(PREFIX,) = { layer stores some metadata here }   // <-- spot already occcupied!
(PREFIX, (123,)) = SOME DATA
(PREFIX, (456,)) = SOME OTHER DATA
(PREFIX, (789,)) = SOME MORE DATA

Using (PREFIX,) + \xFF could also be an issue because, even though the tuple encoding does not uses \xFF as a header byte, other key encoding might, like someone just appending raw uuids or another other compact key encoding.

But if you have complete control over the content of your subspaces and control or audit all tools/scripts that could touch them, then you should definitely combine your reverse mapping key, plus deferring the read until the end of the transaction to reduce the latency even further!

Ah yes, I should’ve added that–We have a wrapper microservice around FDB. We blocked access to the prefix key as a breaking change