Storage Server CPU bottleneck - Growing data lag

Continuing the discussion from here CPU limited storage processes, unfortunately I still have several problems with my FoundationDB cluster.

  1. The fdb cluster worked perfectly until at some point 3-5 days after the cluster was deployed and scaled up, one of the storage processes got progressively increasing data lag (now at ~1h) from which it never recovered.
  2. After several hours, another storage process seems to have sustained high CPU load without the data lag
  3. Disk IO wise the cluster seems healthy, with storage processes having about 30-40% load and the logs having about 60-80% load
  4. The CPU load on the storage processes is 50-85% with 2 processes stuck at 85%
  5. Because of the increased moving data in flight and moving data in queue metrics 1GB and 20GB respectively, I added two process knobs in order to speed up the repartitioning, which so far hasn’t made any noticeable difference:
customParameters:
      - knob_relocation_parallelism_per_source_server=4
      - knob_fetch_keys_parallelism_bytes=5e6

Questions:

  1. Does storage server performance degrades considerably as ssds store more and more data even though they are far from being full?
  2. Are there any other knobs that would help in a situation like that?
  3. What would be a recovery plan for a situation that I described above? Just add more storage servers and wait?
  4. Which are the most common pitfalls design wise which lead to high CPU load?

Here are some details:

  1. When the data lag started

  1. After one day of waiting

  1. The cluster now
Using cluster file `/var/dynamic-conf/fdb.cluster'.

The database is available, but has issues (type 'status' for more information).

Welcome to the fdbcli. For help, type `help'.
fdb> status details

Using cluster file `/var/dynamic-conf/fdb.cluster'.

Unable to start batch priority transaction after 5 seconds.

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 5
  Desired Proxies        - 3
  Desired Resolvers      - 1
  Desired Logs           - 3

Cluster:
  FoundationDB processes - 22
  Zones                  - 22
  Machines               - 22
  Memory availability    - 6.8 GB per process on machine with least available
  Retransmissions rate   - 2 Hz
  Fault Tolerance        - 2 machines
  Server time            - 07/08/20 13:41:03

Data:
  Replication health     - Healthy (Repartitioning)
  Moving data            - 22.368 GB
  Sum of key-value sizes - 176.738 GB
  Disk space used        - 666.725 GB

Operating space:
  Storage server         - 271.3 GB free on most full server
  Log server             - 272.5 GB free on most full server

Workload:
  Read rate              - 787 Hz
  Write rate             - 46351 Hz
  Transactions started   - 58 Hz
  Transactions committed - 48 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.112.1.3:4501        ( 51% cpu; 34% machine; 0.009 Gbps; 25% disk IO; 3.3 GB / 7.2 GB RAM  )
  10.112.2.75:4501       (  1% cpu; 19% machine; 0.000 Gbps;  0% disk IO; 0.3 GB / 9.0 GB RAM  )
  10.112.2.76:4501       ( 12% cpu; 27% machine; 0.067 Gbps;  0% disk IO; 0.4 GB / 9.1 GB RAM  )
  10.112.3.3:4501        ( 86% cpu; 55% machine; 0.017 Gbps; 29% disk IO; 3.6 GB / 6.8 GB RAM  )
  10.112.17.6:4501       ( 54% cpu; 30% machine; 0.012 Gbps; 27% disk IO; 3.4 GB / 7.2 GB RAM  )
  10.112.18.6:4501       (  6% cpu;  6% machine; 0.036 Gbps; 60% disk IO; 2.0 GB / 7.2 GB RAM  )
  10.112.19.6:4501       ( 61% cpu; 32% machine; 0.013 Gbps; 33% disk IO; 3.4 GB / 7.2 GB RAM  )
  10.112.21.3:4501       (  6% cpu; 19% machine; 0.045 Gbps; 38% disk IO; 1.9 GB / 7.2 GB RAM  )
  10.112.22.6:4501       ( 54% cpu; 36% machine; 0.010 Gbps; 30% disk IO; 3.5 GB / 7.1 GB RAM  )
  10.112.23.4:4501       ( 88% cpu; 55% machine; 0.019 Gbps; 24% disk IO; 3.7 GB / 7.2 GB RAM  )
  10.112.24.7:4501       ( 80% cpu; 43% machine; 0.014 Gbps; 32% disk IO; 3.9 GB / 7.1 GB RAM  )
  10.112.26.3:4501       (  8% cpu; 20% machine; 0.036 Gbps; 63% disk IO; 2.0 GB / 7.2 GB RAM  )
  10.112.27.7:4501       ( 84% cpu; 51% machine; 0.015 Gbps; 29% disk IO; 3.6 GB / 7.1 GB RAM  )
  10.112.28.3:4501       (  1% cpu; 16% machine; 0.000 Gbps;  0% disk IO; 0.3 GB / 7.1 GB RAM  )
  10.112.97.10:4501      (  1% cpu; 18% machine; 0.000 Gbps;  0% disk IO; 0.3 GB / 7.2 GB RAM  )
  10.112.98.253:4501     (  2% cpu; 22% machine; 0.000 Gbps;  0% disk IO; 0.3 GB / 9.2 GB RAM  )
  10.112.98.254:4501     (  1% cpu; 21% machine; 0.000 Gbps;  0% disk IO; 0.3 GB / 9.2 GB RAM  )
  10.112.140.7:4501      ( 14% cpu; 29% machine; 0.054 Gbps;  0% disk IO; 0.3 GB / 8.3 GB RAM  )
  10.112.140.8:4501      (  3% cpu; 30% machine; 0.001 Gbps;  1% disk IO; 0.3 GB / 8.2 GB RAM  )
  10.112.140.64:4501     (  9% cpu; 27% machine; 0.032 Gbps;  1% disk IO; 0.3 GB / 8.2 GB RAM  )
  10.112.142.24:4501     ( 12% cpu; 19% machine; 0.065 Gbps;  0% disk IO; 0.3 GB / 9.2 GB RAM  )
  10.112.142.25:4501     (  3% cpu; 19% machine; 0.001 Gbps;  0% disk IO; 0.3 GB / 9.2 GB RAM  )

Coordination servers:
  10.112.1.3:4501  (reachable)
  10.112.17.6:4501  (reachable)
  10.112.23.4:4501  (reachable)
  10.112.26.3:4501  (reachable)
  10.112.28.3:4501  (reachable)

Client time: 07/08/20 13:40:58

fdb> setclass
There are currently 22 processes in the database:
  10.112.1.3:4501: storage (command_line)
  10.112.2.75:4501: stateless (command_line)
  10.112.2.76:4501: stateless (command_line)
  10.112.3.3:4501: storage (command_line)
  10.112.17.6:4501: storage (command_line)
  10.112.18.6:4501: log (command_line)
  10.112.19.6:4501: storage (command_line)
  10.112.21.3:4501: log (command_line)
  10.112.22.6:4501: storage (command_line)
  10.112.23.4:4501: storage (command_line)
  10.112.24.7:4501: storage (command_line)
  10.112.26.3:4501: log (command_line)
  10.112.27.7:4501: storage (command_line)
  10.112.28.3:4501: log (command_line)
  10.112.97.10:4501: stateless (command_line)
  10.112.98.253:4501: stateless (command_line)
  10.112.98.254:4501: stateless (command_line)
  10.112.140.7:4501: stateless (command_line)
  10.112.140.8:4501: stateless (command_line)
  10.112.140.64:4501: stateless (command_line)
  10.112.142.24:4501: stateless (command_line)
  10.112.142.25:4501: stateless (command_line)

Is it possible that the write load is saturating one of the SS just above the edge and causing it to fall behind?

While high cpu% could indicate that the SS is not getting free cycles to perform the writes, but at 85% cpu, there should be enough remaining for SS to not be starved.

How does the disk io activity look like (iostat -x 2)? It might be good to rule out the disk saturation (use fio to benchmark disk iops) before going to other aspects like read/write hotspots and cpu saturation.

Here is the load of the SS which is lagging behind with iostat -x 2:

Linux 4.19.112+ (gke-dogfood-io5-1ff82003-tw3d)         07/08/20        _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          32.24    0.01    5.40    3.12    0.00   59.24

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.01 13698.38   49.96 6772.36   199.90 81881.90    24.06     2.30    0.35    0.19    0.35   0.04  24.23
sda               0.03     0.90    0.07    1.94     1.98    22.14    23.94     0.01    4.92    3.05    4.99   0.89   0.18
dm-0              0.00     0.00    0.01    0.00     0.54     0.00    88.37     0.00    5.74    5.74    0.00   1.96   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          40.97    0.00    6.11    2.29    0.00   50.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00 18843.00   20.50 9046.50    82.00 111544.00    24.62     3.32    0.40    0.20    0.40   0.03  29.25
sda               0.00     0.00    0.00    1.50     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          36.99    0.00    5.36    3.06    0.00   54.59

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00 19435.00   15.00 9852.50    60.00 117126.00    23.75     3.34    0.40    0.20    0.40   0.03  32.55
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Which I don’t think is straining the GCE local SCSI SSD at all.
Here is a list of the GCE specs for the drives:

Disk io looks fine. But is this storage server constantly writing more than 100MB/s and still behind? That seems like a lot of write throughput. Are the other healthy SS doing similar amount of writes?

Can you check the iostat output over longer duration?

I left iostat running for 10 min on the lagging storage server and the results were similar with around 95-120 MB/s writes. I then checked one of the storage servers with the least amount of CPU load at the time and the results were lower on average, somewhere around 80MB/s, with a range from 60 to 140MB/s writes. If I remember correctly in another thread, it was mentioned that fdb had an artificial IOPS limit in order not to saturate SATA SSDs, maybe this limit does interfere?

AFAIK, FDB does not apply any artificial limit to throttle down the iops.

If there is any throttling, it could be due to the disk itself - from the specs you’ve attached, there seems to be a limit of sustained WIOPS of 4680. It could be that on the lagging SS, it may just be going over the limits.

In any case, this seems like a lot of write per server. Are you really doing so much logical writes (aggregated across all SS), from application point of view?

You seem to be running a very write-heavy workload. As the disk doesn’t seem to be saturated, my guess is that the write-queue within FDB is the problem.

The disk interface uses AIO with O_DIRECT, and we only allow to queue 64 operations at a time and reads will get a higher priority. This means that the max throughput is somewhere around 4KB * 64 * disk-latency - or in other words: your performance might be limited by disk latency instead of disk throughput…

There are a few things you can do:

  1. If you have enough CPUs, I would suggest to start more storage servers per disk. You need one CPU core per disk - you could also try to oversubscribe and have two storages run on one CPU, but that might give you weird read-latency behavior, which is also weird.
  2. There’s a knob called MAX_OUTSTANDING which controls how many operations fdb sends to the storage. You can try to set this to a higher value.

I think the first suggestion has a higher probability for success, so if you can I would try this first.

@v-pap can you post the iostat output over a longer duration? The field avgqu-sz, which measures the outstanding queue-depth on the disk, seems quite low (and does not look realistic to support such high iops being observed).

@gaurav This is again for the lagging storage server, I have grep’d the sdb to save space.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00 16709.50   13.00 7995.50    52.00 98814.00    24.69     2.90    0.39    0.23    0.39   0.03  26.35
sdb               0.00 16805.00   13.50 7764.50    54.00 98272.00    25.28     3.07    0.40    0.26    0.40   0.03  26.95
sdb               0.00 19371.50   10.50 8771.50    42.00 112566.00    25.65     3.23    0.40    0.19    0.40   0.03  30.20
sdb               0.00 17859.50    8.00 8540.00    32.00 105592.00    24.71     3.32    0.39    0.25    0.39   0.03  29.35
sdb               0.00 19178.50    8.00 8919.00    32.00 112384.00    25.19     3.46    0.38    0.19    0.38   0.04  31.90
sdb               0.00 17851.50   16.50 8285.50    66.00 104542.00    25.20     3.60    0.40    0.21    0.40   0.04  29.45
sdb               0.00 19606.50   15.00 9467.00    60.00 116288.00    24.54     4.04    0.40    0.23    0.40   0.03  32.85
sdb               0.00 18475.00   22.00 8760.50    88.00 108936.00    24.83     3.33    0.40    0.18    0.40   0.04  30.80
sdb               0.00 18283.50   25.00 8752.00   100.00 108136.00    24.66     2.62    0.39    0.18    0.39   0.03  29.90
sdb               0.00 17794.00   17.00 8216.00    68.00 104034.00    25.29     3.11    0.40    0.21    0.40   0.04  29.40
sdb               0.00 19139.00   15.50 8764.50    62.00 111608.00    25.44     3.47    0.40    0.19    0.40   0.04  30.85
sdb               0.00 18295.50   16.00 8691.00    64.00 107940.00    24.81     3.57    0.42    0.22    0.42   0.03  30.25
sdb               0.00 18305.50    9.50 8471.00    38.00 107100.00    25.27     3.57    0.41    0.21    0.41   0.04  30.10
sdb               0.00 18562.50    9.50 9088.00    38.00 110596.00    24.32     3.27    0.42    0.26    0.42   0.03  30.80
sdb               0.00 18951.50   19.00 8798.50    76.00 110994.00    25.19     3.42    0.39    0.18    0.40   0.03  30.40
sdb               0.00 18863.50   13.50 8917.00    54.00 111116.00    24.90     3.03    0.38    0.22    0.38   0.03  31.00
sdb               0.00 19127.00   25.50 8947.50   102.00 112292.00    25.05     3.60    0.41    0.18    0.41   0.03  31.05
sdb               0.00 18379.00    4.00 8441.50    16.00 107276.00    25.41     3.53    0.39    0.25    0.39   0.04  30.20
sdb               0.00 17855.00   10.00 8278.00    40.00 104526.00    25.23     3.35    0.40    0.20    0.40   0.03  28.80
sdb               0.00 18404.00   14.50 8439.50    58.00 107368.00    25.41     3.34    0.40    0.21    0.40   0.04  30.15
sdb               0.00 21784.50    5.00 8731.50    20.00 121978.00    27.93     3.81    0.40    0.20    0.40   0.04  33.10
sdb               0.00 17068.00   40.00 8871.00   160.00 103830.00    23.34     3.47    0.42    0.19    0.42   0.03  30.15
sdb               0.00 20704.50   12.00 8975.00    48.00 118650.00    26.42     3.55    0.42    0.21    0.42   0.04  32.60
sdb               0.00 19638.00   28.50 8335.00   114.00 111946.00    26.80     3.33    0.41    0.18    0.41   0.04  31.30
sdb               0.00 19215.50   61.00 9510.00   244.00 114898.00    24.06     4.17    0.40    0.17    0.41   0.03  31.40
sdb               0.00 18551.00   21.00 10212.50    84.00 115024.00    22.50     4.24    0.40    0.19    0.40   0.03  31.20
sdb               0.00 19656.00   14.00 9940.00    56.00 118400.00    23.80     3.75    0.41    0.21    0.41   0.03  32.50
sdb               0.00 18036.00    6.50 8312.00    26.00 105318.00    25.33     3.56    0.42    0.15    0.42   0.04  29.90
sdb               0.00 18500.50    4.00 8895.00    16.00 109644.00    24.65     3.86    0.43    0.25    0.43   0.04  31.75
sdb               0.00 18298.50   18.50 8814.50    74.00 108446.00    24.57     3.51    0.38    0.19    0.38   0.03  29.40
sdb               0.00 17564.00   18.50 7841.50    74.00 101616.00    25.88     3.50    0.41    0.19    0.41   0.04  29.25
sdb               0.00 17852.00   11.50 8499.00    46.00 105398.00    24.78     3.33    0.39    0.17    0.39   0.03  29.20
sdb               0.00 18272.50   12.50 8433.00    50.00 106816.00    25.31     3.05    0.40    0.20    0.40   0.04  29.85
sdb               0.00 21082.00   11.00 8662.00    44.00 118908.00    27.43     3.54    0.41    0.23    0.41   0.04  32.95
sdb               0.00 16902.00   32.00 8746.00   128.00 102586.00    23.40     3.51    0.39    0.17    0.39   0.03  28.55
sdb               0.00 21131.00    9.00 9686.00    36.00 123202.00    25.42     3.35    0.40    0.17    0.40   0.03  32.80
sdb               0.00 19789.00   10.00 8959.00    40.00 115104.00    25.68     3.86    0.41    0.25    0.41   0.04  32.55
sdb               0.00 20015.00   26.50 11077.00   106.00 124364.00    22.42     4.44    0.42    0.19    0.42   0.03  33.90
sdb               0.00 17723.00   10.00 8621.50    40.00 105372.00    24.42     3.11    0.41    0.20    0.41   0.04  30.50
sdb               0.00 18684.00   14.50 8719.00    58.00 109606.00    25.11     3.37    0.41    0.21    0.41   0.03  29.90
sdb               0.00 17420.00    8.00 8196.50    32.00 102460.00    24.98     3.55    0.42    0.19    0.42   0.04  28.75
sdb               0.00 19008.00   12.00 8269.50    48.00 109104.00    26.36     3.09    0.40    0.21    0.40   0.04  29.95

Indeed my use case requires a lot of write throughput since I am storing cloud machine metrics.

Unfortunately, it seems like the 1st option wont work in my case, since I am deploying this cluster with the Kubernetes operator which doesn’t seem to support this functionality yet. But I will try the 2nd option.

There are 8 SS. With each SS doing 100 MB/s, this amounts to 3TB/hr. But the posted key-Val size is nowhere near that. Can you think what could explain this discrepancy? How large is each key-val?

You are comparing disk writes to FDB key/value writes. They are very different, and this difference is often referred to as “write amplification”. FDB deals with KV pairs, but those KV pairs are stored in blocks on disk, and to update or insert a KV pair means to write the entire block it resides in or will reside in based on sorted order.

In the worst case, random writes of small values, each FDB kv pair written requires one 4KB disk block to be updated on disk. In the storage engine, on order to update that block atomically with other blocks in the commit, there are actually two total 4KB writes that must happen to change that single record, and then there are other writes that can happen was well for internal record keeping and slack space management.

To put some numbers on a worst-case scenario, writing 16 byte keys with 8 byte values where the keys are uniformly random across the key space and total 1 megabyte worth of 24-byte KV pairs per second would result in about 350MB of actual disk writes per second on the storage servers with the ssd storage engine (or basically any other btree with 4k pages). In triple replication, that number triples as well.

Agreed. I was trying to allude to this possibility of a very high write amplification factor due to many small-sized kv writes spread randomly around. I did miss to factor in replication factor.
If this is the limiting problem, then a change in data model will be needed at application side.

Thanks for adding theses details!

I think I have mentioned my data model already in another thread but I will mention it again just in case you have any idea or recommendation on how to improve it. Each monitored resource has its own directory path: monitored_resource_id/resolution where resource_id is its own unique id and resolution is a string: second, minute, hour or day. Under these directories exist key-value pairs of the form key: (metric, timeinfo) with value: a 64bit int float etc. Of course for other resolutions other than second, I use atomic operations as a way to calculate aggregations on the fly. So each metric is essentially written one time in the seconds resolution and for each other resolution 4 values are updated: 2 counters (count, sum) and 2 other values for max and min (all with atomic operations), in total 1 + 3*4 = 13 writes/counter updates. Each write transaction to the fdb is essentially scraped metric data coming every 5 seconds for each monitored resource. So each write transaction contains around 200 metrics which are to be written under the directory path of the appropriate resource. I guess that these inserts as the data is growing are getting further apart since more key-values are inserted between metrics. I had thought of changing the key to something like (timeinfo, metric) so inserts would possibly require less disk writes, but in this case querying specific metrics would get very expensive.

This is an interesting modeling problem. As a quick modification, is it possible to buffer 5-sec values in memory for a minute, before writing them to fdb. You can probably create one transaction per resource/resolution/metric to write all the raw/derived values.

While this may put more complexity and ram requirement on your layer, you will know if it helps in reducing your write load by ~10x due to the locality of kvs per transaction.

Just curious what really is the amount of data in bytes/sec you are writing, as measured by the application layer? It will be interesting to estimate current write application factor.

That would be an interesting approach but I am afraid at least for my case this would require something like a distributed memory system, since there are many writer processes which may have different bundles of metrics etc.

As far as bytes/sec I don’t have any numbers at the moment but I will try to measure it.

I was suggesting that you try out the buffering approach as a test only in some kind of test workload to see if it helps. That may give some pointers for future.