CPU limited storage processes

I am running a cluster with 10 storage processes and 4 logs (one log is remaining idle) each running on its own machine. Each machine in the cluster has 4 cores, 12GB of RAM and one local nvme SSD. This setup can’t handle more than 90K writes per second. All the storage processes in the cluster seem to be CPU limited since the CPU load is close to 100% (in the core that fdbserver is assigned to) especially when increasing the write rate. Meanwhile the Disk IO is relatively low never going further than 30%. The log processes seems to be fine with around 10% CPU load and 70-80% Disk IO. After getting feedback from several discussions here (Scalable time series database / Improving write throughput) I have tried to optimize my writes by eliminating the need for reads in the same transaction (by using atomic ops) and at the same time, I try to keep my write transactions in the 10KB range. Do you have any recommendations on where I should focus my optimization efforts?

Do your storage server processes have any other roles recruited on them? (see Roles details in log files)

How large are your keys and values?

What is your replication factor?

  • Yes they have other roles recruited to them. In a previous deployment I had set the stateless process count with some value in the kubernetes operator, but these machines were mostly idle, so in the next deployment I set it to -1.

  • The keys are 60-100 bytes in size, the values are 8 bytes

  • The replication factor is 2

A typical deployment uses two SSes (CPUs) for one disk for better disk IOPS utilization.

apiVersion: v1
items:
- apiVersion: apps.foundationdb.org/v1beta1
  kind: FoundationDBCluster
  metadata:
    creationTimestamp: "2020-05-26T13:53:06Z"
    generation: 11
    labels:
      app: fdb
      chart: mist-0.1.0
      controller-tools.k8s.io: "1.0"
      heritage: Tiller
      release: dogfood2
    name: dogfood2-mist-fdb
    namespace: mist2
    resourceVersion: "361948695"
    selfLink: /apis/apps.foundationdb.org/v1beta1/namespaces/mist2/foundationdbclusters/dogfood2-mist-fdb
    uid: 5a69cdfd-3c91-44aa-9164-ec39134efcec
  spec:
    automationOptions: {}
    configured: true
    connectionString: dogfood2_mist_fdb:08zDDL09MjHx6c5FT8jZn6yV0Z4kdoBg@10.112.51.8:4501,10.112.60.4:4501,10.112.55.5:4501
    customParameters:
    - knob_disable_posix_kernel_aio=1
    databaseConfiguration:
      redundancy_mode: double
      storage: 14
      storage_engine: ssd
      usable_regions: 1
    faultDomain:
      key: foundationdb.org/none
    mainContainer: {}
    podTemplate:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - name: foundationdb
          resources:
            limits:
              cpu: 3500m
              memory: 8Gi
            requests:
              cpu: 2000m
              memory: 4Gi
          securityContext:
            runAsUser: 0
        - name: foundationdb-kubernetes-sidecar
          securityContext:
            runAsUser: 0
        initContainers:
        - name: foundationdb-kubernetes-init
          resources: {}
          securityContext:
            runAsUser: 0
        securityContext:
          fsGroup: 0
        tolerations:
        - effect: NoSchedule
          key: dedicated
          operator: Equal
          value: fdb
    processCounts:
      stateless: -1
    runningVersion: 6.2.19
    sidecarContainer: {}
    storageClass: local-scsi
    version: 6.2.19
  status:
    databaseConfiguration:
      log_routers: -1
      log_spill: 2
      logs: 3
      proxies: 3
      redundancy_mode: double
      remote_logs: -1
      resolvers: 1
      storage_engine: ssd-2
      usable_regions: 1
    generations:
      reconciled: 11
    health:
      available: true
      fullReplication: true
    processCounts:
      log: 4
      storage: 14
    requiredAddresses:
      nonTLS: true
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

As you can see the recruited processes have plenty of CPU cores to work with (2-4 cores), but from my measurements I can see that only one of these cores is being utilized. This is true even when the CPU’s core load is close or at 100%.

I did an experiment in which I allowed the foundationdb cluster to recruit as many stateless process as it wants. But as you can see the storage processes are still CPU limited. I don’t really know why that’s the case, so any input would be valuable.

  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Desired Proxies        - 3
  Desired Resolvers      - 1
  Desired Logs           - 3

Cluster:
  FoundationDB processes - 26
  Zones                  - 26
  Machines               - 26
  Memory availability    - 8.3 GB per process on machine with least available
  Retransmissions rate   - 30 Hz
  Fault Tolerance        - 1 machine
  Server time            - 06/10/20 17:12:06

Data:
  Replication health     - Healthy (Repartitioning)
  Moving data            - 168.178 GB
  Sum of key-value sizes - 625.783 GB
  Disk space used        - 1.575 TB

Operating space:
  Storage server         - 242.2 GB free on most full server
  Log server             - 354.4 GB free on most full server

Workload:
  Read rate              - 4430 Hz
  Write rate             - 73889 Hz
  Transactions started   - 692 Hz
  Transactions committed - 646 Hz
  Conflict rate          - 1 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.112.12.222:4501     (  5% cpu; 47% machine; 0.001 Gbps;  0% disk IO; 0.3 GB / 8.3 GB RAM  )
  10.112.25.85:4501      ( 29% cpu; 28% machine; 0.078 Gbps;  0% disk IO; 0.4 GB / 8.8 GB RAM  )
  10.112.28.202:4501     ( 22% cpu; 40% machine; 0.079 Gbps;  0% disk IO; 0.4 GB / 10.4 GB RAM  )
  10.112.42.223:4501     ( 18% cpu; 11% machine; 0.080 Gbps;  0% disk IO; 0.3 GB / 9.9 GB RAM  )
  10.112.49.139:4501     (  5% cpu; 50% machine; 0.001 Gbps;  1% disk IO; 0.3 GB / 10.1 GB RAM  )
  10.112.51.8:4501       ( 65% cpu; 17% machine; 0.013 Gbps; 27% disk IO; 3.3 GB / 14.1 GB RAM  )
  10.112.52.28:4501      ( 83% cpu; 23% machine; 0.023 Gbps; 28% disk IO; 2.7 GB / 14.1 GB RAM  )
  10.112.53.5:4501       ( 18% cpu;  6% machine; 0.064 Gbps; 85% disk IO; 1.1 GB / 14.1 GB RAM  )
  10.112.54.5:4501       ( 18% cpu;  6% machine; 0.057 Gbps; 90% disk IO; 1.0 GB / 14.1 GB RAM  )
  10.112.55.10:4501      (111% cpu; 30% machine; 0.026 Gbps; 27% disk IO; 3.5 GB / 14.1 GB RAM  )
  10.112.56.10:4501      (  0% cpu;  2% machine; 0.000 Gbps;  0% disk IO; 0.3 GB / 13.9 GB RAM  )
  10.112.57.20:4501      (114% cpu; 31% machine; 0.037 Gbps; 27% disk IO; 3.6 GB / 14.1 GB RAM  )
  10.112.58.21:4501      ( 91% cpu; 24% machine; 0.009 Gbps; 31% disk IO; 3.5 GB / 14.1 GB RAM  )
  10.112.59.6:4501       ( 15% cpu;  5% machine; 0.040 Gbps; 94% disk IO; 0.6 GB / 14.1 GB RAM  )
  10.112.60.13:4501      (110% cpu; 30% machine; 0.030 Gbps; 27% disk IO; 4.6 GB / 14.1 GB RAM  )
  10.112.61.4:4501       (100% cpu; 28% machine; 0.142 Gbps; 32% disk IO; 3.5 GB / 14.1 GB RAM  )
  10.112.62.3:4501       ( 96% cpu; 26% machine; 0.082 Gbps; 32% disk IO; 3.5 GB / 14.1 GB RAM  )
  10.112.63.4:4501       ( 57% cpu; 16% machine; 0.005 Gbps; 24% disk IO; 3.6 GB / 14.1 GB RAM  )
  10.112.64.3:4501       (109% cpu; 30% machine; 0.010 Gbps; 32% disk IO; 3.7 GB / 14.2 GB RAM  )
  10.112.65.3:4501       (103% cpu; 28% machine; 0.011 Gbps; 32% disk IO; 3.6 GB / 14.1 GB RAM  )
  10.112.70.37:4501      ( 90% cpu; 49% machine; 0.012 Gbps; 24% disk IO; 3.6 GB / 13.8 GB RAM  )
  10.112.71.3:4501       ( 73% cpu; 42% machine; 0.017 Gbps; 29% disk IO; 3.4 GB / 13.8 GB RAM  )
  10.112.72.3:4501       (106% cpu; 70% machine; 0.092 Gbps; 24% disk IO; 3.7 GB / 13.7 GB RAM  )
  10.112.76.65:4501      ( 18% cpu; 45% machine; 0.010 Gbps;  0% disk IO; 0.3 GB / 10.5 GB RAM  )
  10.112.82.20:4501      ( 22% cpu; 57% machine; 0.070 Gbps;  0% disk IO; 0.4 GB / 9.3 GB RAM  )
  10.112.83.7:4501       (  3% cpu; 11% machine; 0.001 Gbps;  0% disk IO; 0.4 GB / 10.6 GB RAM  )

Coordination servers:
  10.112.52.28:4501  (reachable)
  10.112.60.13:4501  (reachable)
  10.112.64.3:4501  (reachable)

Client time: 06/10/20 17:12:05

fdb> setclass
There are currently 26 processes in the database:
  10.112.12.222:4501: stateless (command_line)
  10.112.25.85:4501: stateless (command_line)
  10.112.28.202:4501: stateless (command_line)
  10.112.42.223:4501: stateless (command_line)
  10.112.49.139:4501: stateless (command_line)
  10.112.51.8:4501: storage (command_line)
  10.112.52.28:4501: storage (command_line)
  10.112.53.5:4501: log (command_line)
  10.112.54.5:4501: log (command_line)
  10.112.55.10:4501: storage (command_line)
  10.112.56.10:4501: stateless (command_line)
  10.112.57.20:4501: storage (command_line)
  10.112.58.21:4501: storage (command_line)
  10.112.59.6:4501: log (command_line)
  10.112.60.13:4501: storage (command_line)
  10.112.61.4:4501: storage (command_line)
  10.112.62.3:4501: storage (command_line)
  10.112.63.4:4501: storage (command_line)
  10.112.64.3:4501: storage (command_line)
  10.112.65.3:4501: storage (command_line)
  10.112.70.37:4501: storage (command_line)
  10.112.71.3:4501: storage (command_line)
  10.112.72.3:4501: storage (command_line)
  10.112.76.65:4501: stateless (command_line)
  10.112.82.20:4501: stateless (command_line)
  10.112.83.7:4501: stateless (command_line)

In your most recent numbers, you have 14 storage servers. Most are saturated, and all are busy. That’s good news, since it means your workload is distributed across many parts of the key space, and the client is offering sufficient concurrency to keep the cluster busy.

I suspect that your NVME drives and network are basically idle in this setup. iostat and nload should be able to confirm this.

Can you try running with 2 and 4 storage servers per drive? If I’m right, that will increase your throughput.

Well, thank you for your input but as far as I remember the kubernetes foundationdb operator doesn’t support multiple storage processes on the same drive. One thing that actually helped with this situation was to actually remove the knob_disable_posix_kernel_aio=1 parameter. Now the CPU load on the storage processes is around to 30-50% with the same workload.