What can cause proxy commit batch memory to be exceeded?

tuk · August 19, 2020, 2:04pm

In one of our 10 node cluster running foundation db 6.2.20 we are observing that database is not available. In the proxy logs I am seeing lot of errors like below

trace.10.196.78.225.4501.1597232984.Qwemy7.1.72.xml:<Event Severity="30" Time="1597826631.057304" Type="ProxyCommitBatchMemoryThresholdExceeded" ID="0000000000000000" SuppressedEventCount="5093" MemBytesCount="429482900" MemLimit="429496729" Machine="10.196.78.225:4501" LogGroup="default" Roles="CD,MP" />

I have placed all the logs here.

foundationdb.conf looks like below

[fdbmonitor]
user = ubuntu
group = ubuntu

[general]
restart_delay = 60
cluster_file = /etc/foundationdb/fdb.cluster

[fdbserver]
command = /usr/bin/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb

[fdbserver.4500]

[fdbserver.4501]
class = stateless

[backup_agent]
command = /usr/lib/foundationdb/backup_agent/backup_agent
logdir = /var/log/foundationdb

[backup_agent.1]

fdb status details output is like below

ubuntu@platform1:~$ fdbcli
Using cluster file `/etc/foundationdb/fdb.cluster'.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status details

WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file `/etc/foundationdb/fdb.cluster'.

Unable to commit after 5 seconds.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Desired Proxies        - 1
  Desired Logs           - 2

Cluster:
  FoundationDB processes - 20
  Zones                  - 10
  Machines               - 10
  Memory availability    - 10.8 GB per process on machine with least available
  Retransmissions rate   - 1 Hz
  Fault Tolerance        - 1 machine
  Server time            - 08/19/20 09:37:27

Data:
  Replication health     - Healthy (Repartitioning)
  Moving data            - 0.161 GB
  Sum of key-value sizes - 349.031 GB
  Disk space used        - 842.336 GB

Operating space:
  Storage server         - 84.4 GB free on most full server
  Log server             - 91.2 GB free on most full server

Workload:
  Read rate              - 842 Hz
  Write rate             - 0 Hz
  Transactions started   - 668 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 87 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.196.78.151:4500     (  2% cpu;  9% machine; 0.009 Gbps;  0% disk IO; 5.1 GB / 10.8 GB RAM  )
  10.196.78.151:4501     (  0% cpu;  9% machine; 0.009 Gbps;  0% disk IO; 0.1 GB / 10.8 GB RAM  )
  10.196.78.152:4500     (  7% cpu;  4% machine; 0.008 Gbps;  6% disk IO; 4.2 GB / 11.4 GB RAM  )
  10.196.78.152:4501     (  2% cpu;  4% machine; 0.008 Gbps;  6% disk IO; 0.1 GB / 11.4 GB RAM  )
  10.196.78.153:4500     (  3% cpu; 10% machine; 0.014 Gbps;  0% disk IO; 3.7 GB / 11.8 GB RAM  )
  10.196.78.153:4501     (  2% cpu; 10% machine; 0.014 Gbps;  0% disk IO; 0.1 GB / 11.8 GB RAM  )
  10.196.78.154:4500     (  2% cpu;  5% machine; 0.012 Gbps;  0% disk IO; 3.6 GB / 10.9 GB RAM  )
  10.196.78.154:4501     (  1% cpu;  5% machine; 0.012 Gbps;  0% disk IO; 0.3 GB / 10.9 GB RAM  )
  10.196.78.155:4500     (  2% cpu;  3% machine; 0.011 Gbps;  0% disk IO; 5.0 GB / 11.0 GB RAM  )
  10.196.78.155:4501     (  0% cpu;  3% machine; 0.011 Gbps;  0% disk IO; 0.1 GB / 11.0 GB RAM  )
  10.196.78.160:4500     (  3% cpu;  2% machine; 0.010 Gbps;  0% disk IO; 4.6 GB / 12.0 GB RAM  )
  10.196.78.160:4501     (  1% cpu;  2% machine; 0.010 Gbps;  0% disk IO; 0.2 GB / 12.0 GB RAM  )
  10.196.78.161:4500     (  2% cpu;  5% machine; 0.006 Gbps;  1% disk IO; 4.6 GB / 12.0 GB RAM  )
  10.196.78.161:4501     (  0% cpu;  5% machine; 0.006 Gbps;  1% disk IO; 0.1 GB / 12.0 GB RAM  )
  10.196.78.162:4500     (  3% cpu;  5% machine; 0.014 Gbps;  0% disk IO; 3.7 GB / 11.0 GB RAM  )
  10.196.78.162:4501     (  1% cpu;  5% machine; 0.014 Gbps;  0% disk IO; 0.2 GB / 11.0 GB RAM  )
  10.196.78.225:4500     (  8% cpu;  6% machine; 0.018 Gbps;  3% disk IO; 4.9 GB / 11.7 GB RAM  )
  10.196.78.225:4501     ( 14% cpu;  6% machine; 0.018 Gbps;  3% disk IO; 2.6 GB / 11.7 GB RAM  )
  10.196.78.226:4500     (  2% cpu;  2% machine; 0.003 Gbps;  0% disk IO; 3.3 GB / 11.7 GB RAM  )
  10.196.78.226:4501     (  1% cpu;  2% machine; 0.003 Gbps;  0% disk IO; 0.3 GB / 11.7 GB RAM  )

Coordination servers:
  10.196.78.154:4501  (reachable)
  10.196.78.225:4501  (reachable)
  10.196.78.226:4501  (reachable)

Client time: 08/19/20 09:37:18

WARNING: A single process is both a transaction log and a storage server.
  For best performance use dedicated disks for the transaction logs by setting process classes.

fdb>

From the logs I am observing that the proxy process is taking about ~4GB mem. What could be causing proxy to take this much memory?
Logs are saying that proxy is taking about 4 GB memory but linux is showing it to take ~2.6GB. Is this expected?

OS - Ubuntu 16.04.6

@alexmiller @ajbeamon - Can you suggest something that could be causing this?

gaurav · August 19, 2020, 2:36pm

Just checking, could this be in some way related to a recent change made in this area here? This PR was merged in 6.2.19; I believe we have not encountered this error on lower fdb versions.

tuk · August 25, 2020, 12:31pm

After restarting the foundation db processes in all the nodes. For last 6 days I have not observed proxy taking much memory (Under 200 MB now) with almost the same load.

I did search the forums and did not find any discussion around ProxyCommitBatchMemoryThresholdExceeded . Given that each transaction has limit of < 100KB, can someone suggest what all things can cause memory pressure in proxy?

status json of current cluster state.

Topic		Replies	Views
Foundationdb 6.2 - fdbserver going out of memory Using FoundationDB	9	1068	April 23, 2020
Segmentation fault error and broken cluster Using FoundationDB	16	4348	June 11, 2018
Cluster tuning cookbook Using FoundationDB	26	8932	February 1, 2019
Continuous out-of-memory crashes in small cluster with modest read-only workload Using FoundationDB	9	1295	November 13, 2018
Constrained RAM in an application development environment Using FoundationDB	3	1289	December 28, 2018

What can cause proxy commit batch memory to be exceeded?

Related topics