Storage Server CPU bottleneck - Growing data lag

You seem to be running a very write-heavy workload. As the disk doesn’t seem to be saturated, my guess is that the write-queue within FDB is the problem.

The disk interface uses AIO with O_DIRECT, and we only allow to queue 64 operations at a time and reads will get a higher priority. This means that the max throughput is somewhere around 4KB * 64 * disk-latency - or in other words: your performance might be limited by disk latency instead of disk throughput…

There are a few things you can do:

  1. If you have enough CPUs, I would suggest to start more storage servers per disk. You need one CPU core per disk - you could also try to oversubscribe and have two storages run on one CPU, but that might give you weird read-latency behavior, which is also weird.
  2. There’s a knob called MAX_OUTSTANDING which controls how many operations fdb sends to the storage. You can try to set this to a higher value.

I think the first suggestion has a higher probability for success, so if you can I would try this first.