I noticed at my cluster some of statless processes (all affected have proxy role) have 100% CPU usage. Alll others (tlog, storage classes) use a CPU average between 20%-50%.

So my first thought was “I have plenty standby statless processes so i am going to increase a number of proxies about twice and resolve an issue!”. After changing a number of proxies the only effect what i observed was increased network upload and download consumption. The aim - getting lower CPU usage by proxy and cluster performance improvment was not achive.

This state with 100% CP usage by proxies last for a long time and i did not find any solution.

My questions:

  • if proxies really scales up?
  • how can i decrease a CPU usage by proxies and get better performance (lower latency)?

My 8-nodes FDB cluster installed against bare metals has the following configuration:

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 5
  Exclusions             - 208 (type `exclude' for details)
  Desired Proxies        - 3
  Desired Resolvers      - 1
  Desired Logs           - 24
  Usable Regions         - 1

  FoundationDB processes - 192
  Zones                  - 8
  Machines               - 8
  Memory availability    - 5.0 GB per process on machine with least available
  Retransmissions rate   - 10 Hz
  Fault Tolerance        - 2 zones
  Server time            - 10/10/22 18:42:27

  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 654.778 GB
  Disk space used        - 2.556 TB

Operating space:
  Storage server         - 1503.8 GB free on most full server
  Log server             - 885.8 GB free on most full server

  Read rate              - 48058 Hz
  Write rate             - 30617 Hz
  Transactions started   - 56685 Hz
  Transactions committed - 15567 Hz
  Conflict rate          - 2 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Server hardware configuration (every single server has the same configuration and the same number of fdb processes):

  • CPU: 32 HT (16 cores x 2.10GHz)
  • Memory: 128GB RAM
  • Storage: 1x NVMe disk for tlog processes, 1x NVMe disk for storage server processes
  • Network: 10G

A couple things come to mind:

It’s hard to say exactly why this is happening, but it does make sense that adding proxies might not reduce cpu usage. I’m not sure what to recommend. I would consider upgrading the cluster to 7.0+ and/or experimenting with fewer tlogs (and somewhat counterintuitively, fewer proxies)