Troubleshooting queue build up

To point out other things that could cause this sort of problem:

  • Your clients could be read hotspot’ing that storage server. Reads are prioritized higher than writes, so too many reads can cause a storage server to lag behind. Be skeptical of this if you’re at 100% CPU usage.
  • You could have a noisy neighbor on that one machine
  • FDB will split shards based on write bandwidth, but I think it actually doesn’t try to then distribute those write-bandwidth shards well. It’s possible that your one storage server is a part of multiple teams that were assigned write-hot shards, and excluding it would force a shard re-assignment that would resolve the issue.
  • A cleaner job doing clear ranges could cause a lot of deferred work. I don’t think we’ve seen FDB6 have saturation issues from a large clear range, but it’s possible it’s AWS/EBS specific? Large clear range performance
  • Data distribution can sometimes cause bad performance effects if it’s too aggressive. I’ve heard other people on EBS needing to dial this down, sometimes.

One storage server only doing an upper limit of ~6MB/s of write sounds reasonable to me. (Lookin’ forward to that faster storage engine, @stevedhams )