Storage server Out of memory error

We are running version 6.2.30 in triple replication mode with some 2000 clients and noticed there are three (always the same process ID) storage servers allocating up to 16 GB of memory (set in config), while all the other 33 allocate 3 GB max.

The memory usage increases steadily since SS start. The workload is roughly 200kHz reads, 2kHz writes. When the SS hits the 16 GB limit (about 10 minutes of work), it logs:

<Event Severity="40" Time="1634070280.211636" DateTime="2021-10-12T20:24:40Z" Type="OutOfMemory" ID="0000000000000000" Message="Out of memory" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x19ed48c 0x19ecc40 0x19ecd21 0x19ce95e 0x19ce98c 0x19cea09 0x19afcb3 0x19b0041 0x19a144d 0x19a18f0 0x6c2006 0x19ebda7 0x19ed576 0x19ecc40 0x19ecd21 0x19ce95e 0x19ce98c 0x19cea09 0x19afcb3 0x19b0041 0x6e3d8c 0x1938053 0x1938360 0x19387da 0x193a8e8 0x193963b 0x193af8e 0x8018f0 0x1a2b270 0x6784f9 0x7feaff0c309b" Machine="10.64.3.103:4516" LogGroup="default" Roles="SS" />

Is this a bug or client usage pattern issue?

I tried replacing one of the failing SS and what we see is that other SS is failing with similar issue, reporting

<Event Severity="40" Time="1634082350.520501" DateTime="2021-10-12T23:45:50Z" Type="OutOfMemory" ID="0000000000000000" Message="Out of memory" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x19ed48c 0x19ecc40 0x19ecd21 0x19ce95e 0x19ce98c 0x19cea09 0x19afcb3 0x19b0041 0x6e3b6e 0x6e4116 0x1938053 0x1938360 0x19387da 0x193a8e8 0x193963b 0x193ab5f 0x8018f0 0x1a2b270 0x6784f9 0x7f00b758809b" Machine="10.64.3.104:4516" LogGroup="default" Roles="SS" />

What is strange now is that the memory used if far from the limit:

<Event Severity="10" Time="1634082348.104493" DateTime="2021-10-12T23:45:48Z" Type="MemoryMetrics" ID="0000000000000000" TotalMemory16="393216" ApproximateUnusedMemory16="0" ActiveThreads16="1" TotalMemory32="50855936" ApproximateUnusedMemory32="5898240" ActiveThreads32="1" TotalMemory64="137756672" ApproximateUnusedMemory64="25296896" ActiveThreads64="4" TotalMemory96="101687040" ApproximateUnusedMemory96="6027840" ActiveThreads96="1" TotalMemory128="524288" ApproximateUnusedMemory128="131072" ActiveThreads128="1" TotalMemory256="79429632" ApproximateUnusedMemory256="0" ActiveThreads256="1" TotalMemory512="393216" ApproximateUnusedMemory512="0" ActiveThreads512="1" TotalMemory1024="131072" ApproximateUnusedMemory1024="0" ActiveThreads1024="1" TotalMemory2048="131072" ApproximateUnusedMemory2048="0" ActiveThreads2048="1" TotalMemory4096="2148532224" ApproximateUnusedMemory4096="0" ActiveThreads4096="1" TotalMemory8192="4194304" ApproximateUnusedMemory8192="393216" ActiveThreads8192="1" HugeArenaMemory="58147" DCID="[not set]" ZoneID="2" MachineID="04" Machine="10.64.3.104:4516" LogGroup="default" Roles="SS" />
<Event Severity="10" Time="1634082348.104493" DateTime="2021-10-12T23:45:48Z" Type="MachineMetrics" ID="0000000000000000" Elapsed="5.00003" MbpsSent="96.9022" MbpsReceived="112.815" OutSegs="140224" RetransSegs="38" CPUSeconds="2.45425" TotalMemory="67557892096" CommittedMemory="34244521984" AvailableMemory="33313370112" DCID="[not set]" ZoneID="2" MachineID="04" Machine="10.64.3.104:4516" LogGroup="default" Roles="SS" TrackLatestType="Original" />