Ssd-rocksdb-v1 storage engine runs out of memory

Hi,in the process of using the ssd-rocksdb-v1 storage engine to test the write with the ycsb tool, OOM occurred.
The cluster deployment is as follows:

  • FDB version 7.1.25
  • Cluster size total 18 nodes, configured with 3DC mode three_datacenter, each DC has 6 nodes.
  • Each node contains 12 ssds, 8 are used for storage services, 1 is used for log services, and 3 are used for stateless services.
  • The memory configuration is 10GiB, the cache-memory configuration is 4GiB.
  • K8S deployment is used, and a process is started by the fdbserver binary in a single pod with limit CPU 18G.

When OOM occurs, it can be found from the OS log that the fdbserver process has used 18GB of memory, which exceeds memory + cache-memory.

fdbcli status:

fdb> status details 

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - three_datacenter
  Storage engine         - ssd-rocksdb-v1
  Coordinators           - 7
  Desired Logs           - 12
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 216
  Zones                  - 18
  Machines               - 18
  Memory availability    - 10.0 GB per process on machine with least available
  Retransmissions rate   - 57 Hz
  Fault Tolerance        - 3 machines
  Server time            - 01/30/23 05:43:34

Data:
  Replication health     - Healthy (Rebalancing)
  Moving data            - 0.409 GB
  Sum of key-value sizes - 182.220 GB
  Disk space used        - 1.211 TB

Operating space:
  Storage server         - 840.2 GB free on most full server
  Log server             - 849.5 GB free on most full server

Workload:
  Read rate              - 1599 Hz
  Write rate             - 20 Hz
  Transactions started   - 5 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.181.159.41:5500     (  2% cpu;  3% machine; 0.081 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.41:5504     (  4% cpu;  3% machine; 0.081 Gbps;  1% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.41:5508     (  1% cpu;  3% machine; 0.081 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.181.159.41:5512     (  1% cpu;  3% machine; 0.081 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.41:5516     (  1% cpu;  3% machine; 0.081 Gbps;  0% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.181.159.41:5520     (  1% cpu;  3% machine; 0.081 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.41:5524     (  1% cpu;  3% machine; 0.081 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.41:5528     (  1% cpu;  3% machine; 0.081 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.41:6500     (  2% cpu;  3% machine; 0.081 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.181.159.41:7500     (  0% cpu;  3% machine; 0.081 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.41:7501     (  1% cpu;  3% machine; 0.081 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.41:7502     (  2% cpu;  3% machine; 0.081 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.181.159.46:5500     (  1% cpu;  6% machine; 0.152 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.46:5504     (  9% cpu;  6% machine; 0.152 Gbps; 13% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.46:5508     (  2% cpu;  6% machine; 0.152 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.181.159.46:5512     (  4% cpu;  6% machine; 0.152 Gbps;  1% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.181.159.46:5516     (  2% cpu;  6% machine; 0.152 Gbps;  1% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.181.159.46:5520     ( 30% cpu;  6% machine; 0.152 Gbps; 31% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.46:5524     (  9% cpu;  6% machine; 0.152 Gbps; 13% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.46:5528     (  1% cpu;  6% machine; 0.152 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.46:6500     (  2% cpu;  6% machine; 0.152 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.181.159.46:7500     (  2% cpu;  6% machine; 0.152 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.181.159.46:7501     (  0% cpu;  6% machine; 0.152 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.46:7502     (  0% cpu;  6% machine; 0.152 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.47:5500     (  6% cpu;  6% machine; 0.006 Gbps;  6% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.47:5504     (  1% cpu;  6% machine; 0.006 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.47:5508     (  1% cpu;  6% machine; 0.006 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.47:5512     (  1% cpu;  6% machine; 0.006 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.47:5516     (  1% cpu;  6% machine; 0.006 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.47:5520     (  1% cpu;  6% machine; 0.006 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.47:5524     (  1% cpu;  6% machine; 0.006 Gbps;  0% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.181.159.47:5528     (  1% cpu;  6% machine; 0.006 Gbps;  0% disk IO; 2.8 GB / 10.0 GB RAM  )
  10.181.159.47:6500     (  2% cpu;  6% machine; 0.006 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.181.159.47:7500     (  3% cpu;  6% machine; 0.006 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.47:7501     (  0% cpu;  6% machine; 0.006 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.47:7502     (  0% cpu;  6% machine; 0.006 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.48:5500     ( 49% cpu;  6% machine; 0.207 Gbps; 55% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.48:5504     (  5% cpu;  6% machine; 0.207 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.48:5508     (  1% cpu;  6% machine; 0.207 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.181.159.48:5512     (  1% cpu;  6% machine; 0.207 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.48:5516     (  1% cpu;  6% machine; 0.207 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.48:5520     (  2% cpu;  6% machine; 0.207 Gbps;  2% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.181.159.48:5524     (  1% cpu;  6% machine; 0.207 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.48:5528     (  1% cpu;  6% machine; 0.207 Gbps;  0% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.181.159.48:6500     (  2% cpu;  6% machine; 0.207 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.181.159.48:7500     (  8% cpu;  6% machine; 0.207 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.48:7501     (  0% cpu;  6% machine; 0.207 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.48:7502     (  0% cpu;  6% machine; 0.207 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.64:5500     ( 33% cpu;  7% machine; 0.172 Gbps; 35% disk IO; 0.6 GB / 10.0 GB RAM  )
  10.181.159.64:5504     (  1% cpu;  7% machine; 0.172 Gbps;  0% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.181.159.64:5508     ( 36% cpu;  7% machine; 0.172 Gbps; 33% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.181.159.64:5512     (  1% cpu;  7% machine; 0.172 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.181.159.64:5516     (  1% cpu;  7% machine; 0.172 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.64:5520     (  1% cpu;  7% machine; 0.172 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.64:5524     (  1% cpu;  7% machine; 0.172 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.64:5528     (  1% cpu;  7% machine; 0.172 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.64:6500     (  0% cpu;  7% machine; 0.172 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.181.159.64:7500     (  2% cpu;  7% machine; 0.172 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.64:7501     (  0% cpu;  7% machine; 0.172 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.64:7502     (  0% cpu;  7% machine; 0.172 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.65:5500     ( 12% cpu; 11% machine; 0.292 Gbps;  1% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.65:5504     (  2% cpu; 11% machine; 0.292 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.65:5508     (  3% cpu; 11% machine; 0.292 Gbps;  1% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.65:5512     (  2% cpu; 11% machine; 0.292 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.65:5516     (  9% cpu; 11% machine; 0.292 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.65:5520     (  2% cpu; 11% machine; 0.292 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.65:5524     (  2% cpu; 11% machine; 0.292 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.181.159.65:5528     (  2% cpu; 11% machine; 0.292 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.181.159.65:6500     (  0% cpu; 11% machine; 0.292 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.181.159.65:7500     (  5% cpu; 11% machine; 0.292 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.181.159.65:7501     (  1% cpu; 11% machine; 0.292 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.181.159.65:7502     (  0% cpu; 11% machine; 0.292 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.44:5500     (  8% cpu;  9% machine; 0.224 Gbps;  3% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.195.152.44:5504     (  1% cpu;  9% machine; 0.224 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.44:5508     (  1% cpu;  9% machine; 0.224 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.44:5512     (  1% cpu;  9% machine; 0.224 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.195.152.44:5516     (  1% cpu;  9% machine; 0.224 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.44:5520     (  3% cpu;  9% machine; 0.224 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.44:5524     (  4% cpu;  9% machine; 0.224 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.195.152.44:5528     (  2% cpu;  9% machine; 0.224 Gbps;  1% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.44:6500     (  2% cpu;  9% machine; 0.224 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.195.152.44:7500     (  0% cpu;  9% machine; 0.224 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.44:7501     (  0% cpu;  9% machine; 0.224 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.44:7502     (  0% cpu;  9% machine; 0.224 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.45:5500     ( 34% cpu; 10% machine; 0.136 Gbps; 36% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.45:5504     (  1% cpu; 10% machine; 0.136 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.195.152.45:5508     (  1% cpu; 10% machine; 0.136 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.45:5512     (  1% cpu; 10% machine; 0.136 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.45:5516     (  2% cpu; 10% machine; 0.136 Gbps;  1% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.45:5520     (  1% cpu; 10% machine; 0.136 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.45:5524     (  4% cpu; 10% machine; 0.136 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.195.152.45:5528     (  1% cpu; 10% machine; 0.136 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.45:6500     (  2% cpu; 10% machine; 0.136 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.195.152.45:7500     (  0% cpu; 10% machine; 0.136 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.45:7501     (  0% cpu; 10% machine; 0.136 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.45:7502     (  0% cpu; 10% machine; 0.136 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.47:5500     (  2% cpu;  6% machine; 0.144 Gbps;  1% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.47:5504     (  4% cpu;  6% machine; 0.144 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.47:5508     (  1% cpu;  6% machine; 0.144 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.195.152.47:5512     (  1% cpu;  6% machine; 0.144 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.47:5516     (  1% cpu;  6% machine; 0.144 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.47:5520     (  1% cpu;  6% machine; 0.144 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.47:5524     (  1% cpu;  6% machine; 0.144 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.47:5528     (  1% cpu;  6% machine; 0.144 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.47:6500     (  0% cpu;  6% machine; 0.144 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.195.152.47:7500     (  0% cpu;  6% machine; 0.144 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.47:7501     (  0% cpu;  6% machine; 0.144 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.47:7502     (  0% cpu;  6% machine; 0.144 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.48:5500     (  1% cpu; 10% machine; 0.229 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.48:5504     (  2% cpu; 10% machine; 0.229 Gbps;  1% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.48:5508     (  1% cpu; 10% machine; 0.229 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.195.152.48:5512     (  1% cpu; 10% machine; 0.229 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.48:5516     (  1% cpu; 10% machine; 0.229 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.48:5520     (  4% cpu; 10% machine; 0.229 Gbps;  1% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.48:5524     (  1% cpu; 10% machine; 0.229 Gbps;  0% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.195.152.48:5528     (  1% cpu; 10% machine; 0.229 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.195.152.48:6500     (  0% cpu; 10% machine; 0.229 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.48:7500     (  0% cpu; 10% machine; 0.229 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.48:7501     (  0% cpu; 10% machine; 0.229 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.48:7502     (  0% cpu; 10% machine; 0.229 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.50:5500     (  1% cpu; 11% machine; 0.149 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.50:5504     (  4% cpu; 11% machine; 0.149 Gbps;  1% disk IO; 2.8 GB / 10.0 GB RAM  )
  10.195.152.50:5508     (  1% cpu; 11% machine; 0.149 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.50:5512     (  1% cpu; 11% machine; 0.149 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.50:5516     (  1% cpu; 11% machine; 0.149 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.50:5520     (  1% cpu; 11% machine; 0.149 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.50:5524     (  1% cpu; 11% machine; 0.149 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.195.152.50:5528     (  1% cpu; 11% machine; 0.149 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.50:6500     (  2% cpu; 11% machine; 0.149 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.195.152.50:7500     (  0% cpu; 11% machine; 0.149 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.50:7501     (  0% cpu; 11% machine; 0.149 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.50:7502     (  0% cpu; 11% machine; 0.149 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.51:5500     (  1% cpu; 13% machine; 0.014 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.51:5504     (  1% cpu; 13% machine; 0.014 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.51:5508     (  2% cpu; 13% machine; 0.014 Gbps;  1% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.195.152.51:5512     (  1% cpu; 13% machine; 0.014 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.51:5516     (  1% cpu; 13% machine; 0.014 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.51:5520     (  1% cpu; 13% machine; 0.014 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.51:5524     (  2% cpu; 13% machine; 0.014 Gbps;  1% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.152.51:5528     (  4% cpu; 13% machine; 0.014 Gbps;  3% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.152.51:6500     (  2% cpu; 13% machine; 0.014 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.195.152.51:7500     (  0% cpu; 13% machine; 0.014 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.152.51:7501     (  1% cpu; 13% machine; 0.014 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.195.152.51:7502     (  0% cpu; 13% machine; 0.014 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.44:5500     ( 49% cpu;  9% machine; 0.158 Gbps; 56% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.44:5504     (  4% cpu;  9% machine; 0.158 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.44:5508     (  1% cpu;  9% machine; 0.158 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.195.154.44:5512     (  1% cpu;  9% machine; 0.158 Gbps;  0% disk IO; 2.8 GB / 10.0 GB RAM  )
  10.195.154.44:5516     (  1% cpu;  9% machine; 0.158 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.44:5520     (  1% cpu;  9% machine; 0.158 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.44:5524     (  1% cpu;  9% machine; 0.158 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.44:5528     (  1% cpu;  9% machine; 0.158 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.44:6500     (  2% cpu;  9% machine; 0.158 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.195.154.44:7500     (  0% cpu;  9% machine; 0.158 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.44:7501     (  0% cpu;  9% machine; 0.158 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.44:7502     (  0% cpu;  9% machine; 0.158 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.45:5500     (  1% cpu;  7% machine; 0.157 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.45:5504     (  7% cpu;  7% machine; 0.157 Gbps;  3% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.45:5508     (  4% cpu;  7% machine; 0.157 Gbps;  1% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.195.154.45:5512     (  1% cpu;  7% machine; 0.157 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.45:5516     (  1% cpu;  7% machine; 0.157 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.45:5520     (  3% cpu;  7% machine; 0.157 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.45:5524     (  2% cpu;  7% machine; 0.157 Gbps;  1% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.45:5528     ( 13% cpu;  7% machine; 0.157 Gbps; 12% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.45:6500     (  2% cpu;  7% machine; 0.157 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.195.154.45:7500     (  0% cpu;  7% machine; 0.157 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.45:7501     (  0% cpu;  7% machine; 0.157 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.45:7502     (  0% cpu;  7% machine; 0.157 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.46:5500     (  1% cpu;  4% machine; 0.270 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.46:5504     (  1% cpu;  4% machine; 0.270 Gbps;  0% disk IO; 2.3 GB / 10.0 GB RAM  )
  10.195.154.46:5508     (  3% cpu;  4% machine; 0.270 Gbps;  6% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.46:5512     (  1% cpu;  4% machine; 0.270 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.46:5516     (  1% cpu;  4% machine; 0.270 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.46:5520     (  1% cpu;  4% machine; 0.270 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.46:5524     (  4% cpu;  4% machine; 0.270 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.46:5528     (  4% cpu;  4% machine; 0.270 Gbps;  0% disk IO; 2.4 GB / 10.0 GB RAM  )
  10.195.154.46:6500     (  0% cpu;  4% machine; 0.270 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.195.154.46:7500     (  0% cpu;  4% machine; 0.270 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.46:7501     (  0% cpu;  4% machine; 0.270 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.46:7502     (  0% cpu;  4% machine; 0.270 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.47:5500     (  1% cpu;  9% machine; 0.013 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.47:5504     ( 14% cpu;  9% machine; 0.013 Gbps; 13% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.47:5508     (  1% cpu;  9% machine; 0.013 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.47:5512     (  1% cpu;  9% machine; 0.013 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.47:5516     (  1% cpu;  9% machine; 0.013 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.47:5520     (  1% cpu;  9% machine; 0.013 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.47:5524     (  1% cpu;  9% machine; 0.013 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.47:5528     (  1% cpu;  9% machine; 0.013 Gbps;  0% disk IO; 2.7 GB / 10.0 GB RAM  )
  10.195.154.47:6500     (  2% cpu;  9% machine; 0.013 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.195.154.47:7500     (  0% cpu;  9% machine; 0.013 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.47:7501     (  0% cpu;  9% machine; 0.013 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.47:7502     (  0% cpu;  9% machine; 0.013 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.48:5500     (  1% cpu;  8% machine; 0.003 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.48:5504     (  2% cpu;  8% machine; 0.003 Gbps;  1% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.48:5508     (  1% cpu;  8% machine; 0.003 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.48:5512     (  1% cpu;  8% machine; 0.003 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.48:5516     (  1% cpu;  8% machine; 0.003 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.48:5520     (  7% cpu;  8% machine; 0.003 Gbps; 10% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.48:5524     (  1% cpu;  8% machine; 0.003 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.48:5528     (  1% cpu;  8% machine; 0.003 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.48:6500     (  2% cpu;  8% machine; 0.003 Gbps;  0% disk IO; 0.2 GB / 10.0 GB RAM  )
  10.195.154.48:7500     (  0% cpu;  8% machine; 0.003 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.48:7501     (  0% cpu;  8% machine; 0.003 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.48:7502     (  0% cpu;  8% machine; 0.003 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.50:5500     (  1% cpu;  7% machine; 0.288 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.50:5504     (  4% cpu;  7% machine; 0.288 Gbps;  1% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.50:5508     ( 36% cpu;  7% machine; 0.288 Gbps; 32% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.50:5512     (  1% cpu;  7% machine; 0.288 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.50:5516     (  1% cpu;  7% machine; 0.288 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.50:5520     (  1% cpu;  7% machine; 0.288 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.50:5524     (  1% cpu;  7% machine; 0.288 Gbps;  0% disk IO; 2.6 GB / 10.0 GB RAM  )
  10.195.154.50:5528     (  1% cpu;  7% machine; 0.288 Gbps;  0% disk IO; 2.5 GB / 10.0 GB RAM  )
  10.195.154.50:6500     (  0% cpu;  7% machine; 0.288 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.50:7500     (  0% cpu;  7% machine; 0.288 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.50:7501     (  0% cpu;  7% machine; 0.288 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )
  10.195.154.50:7502     (  0% cpu;  7% machine; 0.288 Gbps;  0% disk IO; 0.1 GB / 10.0 GB RAM  )

Coordination servers:
  10.181.159.41:7502  (reachable)
  10.181.159.47:6500  (reachable)
  10.181.159.65:7501  (reachable)
  10.195.152.48:5500  (reachable)
  10.195.152.50:5508  (reachable)
  10.195.152.51:7501  (reachable)
  10.195.154.46:5520  (reachable)

Client time: 01/30/23 05:44:29

The way of storage server runs

UID         PID   PPID  C STIME TTY          TIME CMD
root          1      0  9 09:10 ?        00:04:29 fdbserver --memory 10GiB --cache-memory 4GiB --seed-connection-string docker:docker@10.181.159.41:5500 --cluster-file /etc/foundationdb/fdb.cluster --listen-address 0.0.0.0:5506 --public-address 10.195.152.51:5506 --locality-diskid sdb --datadir /var/fdb/data --logdir /var/fdb/logs --locality-machineid hostname-01 --locality-zoneid hostname-01 --class storage --locality-dcid ningbo1

OS log:

Jan 30 15:55:51 hostname-01 kernel: rocksdb:low invoked oom-killer: gfp_mask=0x6201ca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), nodemask=(null), order=0, oom_score_adj=999
Jan 30 15:55:51 hostname-01 kernel: rocksdb:low cpuset=0e78f7a7e07b1325f491dd7f2e39462ba3775d5cf9585d1a2961e828ce0a27f5 mems_allowed=0-1
Jan 30 15:55:51 hostname-01 kernel: CPU: 58 PID: 454919 Comm: rocksdb:low Kdump: loaded Not tainted 4.19.25-206.el7_6.bclinux.x86_64 #1
Jan 30 15:55:51 hostname-01 kernel: Hardware name: ZTE R5500 G4/R5500G4, BIOS 03.15.0100_70562 03/04/2020
Jan 30 15:55:51 hostname-01 kernel: Call Trace:
Jan 30 15:55:51 hostname-01 kernel:  dump_stack+0x5a/0x73
Jan 30 15:55:51 hostname-01 kernel:  dump_header+0x77/0x29c
Jan 30 15:55:51 hostname-01 kernel:  ? mem_cgroup_scan_tasks+0x8f/0xe0
Jan 30 15:55:51 hostname-01 kernel:  oom_kill_process+0x25e/0x290
Jan 30 15:55:51 hostname-01 kernel:  out_of_memory+0x134/0x4b0
Jan 30 15:55:51 hostname-01 kernel:  mem_cgroup_out_of_memory+0x49/0x80
Jan 30 15:55:51 hostname-01 kernel:  try_charge+0x6f2/0x760
Jan 30 15:55:51 hostname-01 kernel:  mem_cgroup_try_charge+0x6f/0x220
Jan 30 15:55:51 hostname-01 kernel:  __add_to_page_cache_locked+0x146/0x260
Jan 30 15:55:51 hostname-01 kernel:  add_to_page_cache_lru+0x49/0xd0
Jan 30 15:55:51 hostname-01 kernel:  pagecache_get_page+0x7e/0x270
Jan 30 15:55:51 hostname-01 kernel:  grab_cache_page_write_begin+0x1f/0x40
Jan 30 15:55:51 hostname-01 kernel:  ext4_da_write_begin+0xdf/0x4f0 [ext4]
Jan 30 15:55:51 hostname-01 kernel:  generic_perform_write+0xc2/0x1c0
Jan 30 15:55:51 hostname-01 kernel:  __generic_file_write_iter+0x184/0x1c0
Jan 30 15:55:51 hostname-01 kernel:  ext4_file_write_iter+0xc6/0x410 [ext4]
Jan 30 15:55:51 hostname-01 kernel:  ? __switch_to_asm+0x40/0x70
Jan 30 15:55:51 hostname-01 kernel:  __vfs_write+0x112/0x1a0
Jan 30 15:55:51 hostname-01 kernel:  vfs_write+0xad/0x1a0
Jan 30 15:55:51 hostname-01 kernel:  ksys_write+0x52/0xc0
Jan 30 15:55:51 hostname-01 kernel:  do_syscall_64+0x5b/0x170
Jan 30 15:55:51 hostname-01 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 30 15:55:51 hostname-01 kernel: RIP: 0033:0x7fe8700726fd
Jan 30 15:55:51 hostname-01 kernel: Code: cd 20 00 00 75 10 b8 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 4e fd ff ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 97 fd ff ff 48 89 d0 48 83 c4 08 48 3d 01
Jan 30 15:55:51 hostname-01 kernel: RSP: 002b:00007fe86a1f53a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Jan 30 15:55:51 hostname-01 kernel: RAX: ffffffffffffffda RBX: 00007fe86a1f54b0 RCX: 00007fe8700726fd
Jan 30 15:55:51 hostname-01 kernel: RDX: 00000000000ffa8d RSI: 00007fe7ba334000 RDI: 000000000000001d
Jan 30 15:55:51 hostname-01 kernel: RBP: 00007fe86a1f5400 R08: 0000000000000000 R09: 0000000000000000
Jan 30 15:55:51 hostname-01 kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007fe7ba334000
Jan 30 15:55:51 hostname-01 kernel: R13: 00000000000ffa8d R14: 00000000000ffa8d R15: 00007fe856aa4c50
Jan 30 15:55:51 hostname-01 kernel: Task in /kubepods/burstable/podfaf6d591-a4a4-4cd1-aa8d-d4906f29ec11/0e78f7a7e07b1325f491dd7f2e39462ba3775d5cf9585d1a2961e828ce0a27f5 killed as a result of limit of /kubepods/burstable/podfaf6d591-a4a4-4cd1-aa8d-d4906f29ec11
Jan 30 15:55:51 hostname-01 kernel: memory: usage 17578124kB, limit 17578124kB, failcnt 74
Jan 30 15:55:51 hostname-01 kernel: memory+swap: usage 17578124kB, limit 9007199254740988kB, failcnt 0
Jan 30 15:55:51 hostname-01 kernel: kmem: usage 468896kB, limit 9007199254740988kB, failcnt 0
Jan 30 15:55:51 hostname-01 kernel: Memory cgroup stats for /kubepods/burstable/podfaf6d591-a4a4-4cd1-aa8d-d4906f29ec11: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Jan 30 15:55:51 hostname-01 kernel: Memory cgroup stats for /kubepods/burstable/podfaf6d591-a4a4-4cd1-aa8d-d4906f29ec11/b01c03eb030730627899d164596534e1bdebeb39c0ce1f67d6bb97861ae4f10f: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Jan 30 15:55:51 hostname-01 kernel: Memory cgroup stats for /kubepods/burstable/podfaf6d591-a4a4-4cd1-aa8d-d4906f29ec11/0d6a6ea3b8882339088e3b23159b54352315385c2b57b12e5fbaf41d81d7e677: cache:6408KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:616KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Jan 30 15:55:51 hostname-01 kernel: Memory cgroup stats for /kubepods/burstable/podfaf6d591-a4a4-4cd1-aa8d-d4906f29ec11/0e78f7a7e07b1325f491dd7f2e39462ba3775d5cf9585d1a2961e828ce0a27f5: cache:15723104KB rss:1369772KB rss_huge:0KB shmem:72KB mapped_file:0KB dirty:14388KB writeback:1452KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Jan 30 15:55:51 hostname-01 kernel: Tasks state (memory values in pages):
Jan 30 15:55:51 hostname-01 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jan 30 15:55:51 hostname-01 kernel: [ 492467]     0 492467      242        1    28672        0          -998 pause
Jan 30 15:55:51 hostname-01 kernel: [ 454463]     0 454463   769526   354219  5349376        0           999 fdbserver
Jan 30 15:55:51 hostname-01 kernel: Memory cgroup out of memory: Kill process 454463 (fdbserver) score 1079 or sacrifice child
Jan 30 15:55:51 hostname-01 kernel: Killed process 454463 (fdbserver) total-vm:3078104kB, anon-rss:1371748kB, file-rss:45128kB, shmem-rss:0kB
Jan 30 15:55:51 hostname-01 kernel: oom_reaper: reaped process 454463 (fdbserver), now anon-rss:0kB, file-rss:20kB, shmem-rss:0kB

I know the rocksdb storage engine is still an experimental feature, so what can we do to improve it?

1 Like

@jzhou I have the same issue, any advices?

It’s hard to tell what happened without detailed information. Generally we should look for traffic patterns and FDB logs to figure out the reason. For instance, is there a lot of read traffic, write traffic? How much memory RocksDB is using? Is there a core file for debugging?

From FDB log, we can look at ProcessMetrics and MemoryMetrics events, e.g., Splunking Type=MemoryMetrics | timechart max(TotalMemory*). We may also look at GetMagazineSample and HugeArenaSample for memory allocation traces. OOM can happen quite quickly within a few milliseconds. So the events logged just give a rough idea.

Finally, Yao discovered recently that RocksDB can OOM with about 2k clearRange operations, which may or may not be the same with your workload.

Foundationdb 7.1.31 is ships with a newer rocksdb: 7.10.2 instead of 7.7.3. So I’d advice to upgrade foundationdb before research: may be this issue has already been fixed.

Stateless services usually do not use disk starage at all (or they use it a little for coordinators), so allocating 3 ssds for them seems unnecessary.

BTW, RocksDB engine in 7.1 and 7.2 is still not production-ready yet. The team has been actively fixing a OOM issue related to ClearRange and another corruption bug.

I observe similar behaviour in 7.1.33, where after a few days storage servers with ssd-rocksdb-v1 storage engine consume more memory than memory + cache-memory.

Can I ask if this issue has resolved? What are your recommendations on using ssd-rocksdb-v1 storage engine?

Thansk