We have a cluster of 7 nodes, with a total of 56 fdbserver processes, storing over 50 billion entries. After a clean reboot of an fdbserver process, the memory usage exceeds 17GB. If we set the memory limit in the foundationdb.conf file to below 17GB, the fdbserver processes cannot successfully boot.
Thanks for your reply. The storage engine is ssd-2, fdbcli --exec "status;" gives us:
$ fdbcli
Using cluster file `fdb.cluster'.
The database is available.
Welcome to the fdbcli. For help, type `help'.
fdb> status
Using cluster file `fdb.cluster'.
Configuration:
Redundancy mode - triple
Storage engine - ssd-2
Coordinators - 7
Usable Regions - 1
Cluster:
FoundationDB processes - 56
Zones - 7
Machines - 7
Memory availability - 32.0 GB per process on machine with least available
Retransmissions rate - 1 Hz
Fault Tolerance - 2 machines
Server time - 03/22/23 10:17:44
Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 49.688 TB
Disk space used - 197.873 TB
Operating space:
Storage server - 3321.1 GB free on most full server
Log server - 3321.9 GB free on most full server
Workload:
Read rate - 345451 Hz
Write rate - 0 Hz
Transactions started - 337321 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Client time: 03/22/23 10:17:44
You have 7 machines, holding 56 processes. I am assuming this is 8 per machine. What is the configuration for [fdbserver] section? https://apple.github.io/foundationdb/configuration.html#fdbserver-section
Particularly memory=. If it is defaulting to 8 GB, as far as I know, you need 64 GB max per machine.
In addition, your earlier message sounded as is the memory usage goes back up immediately after reboot (even if there is no activity on it). Your status output indicates a fairly good amount of traffic. Is it really going that high as soon as db is up or after getting enough traffic?
Clarifying these might help someone answer the question.
I’m not familiar with the tool you are using but “Memory” of 17gb and “Memory Used” of 52% would seem to indicate a 17GB virtual memory footprint and about half of that is actually Resident. Is that correct?
Virtual Memory will often be a factor of 2x or 3x larger than Resident memory, but Resident is what matters. In FDB 7.1, the malloc implementation was changed from glibc to jemalloc, which has a much larger virtual : resident ratio.
There are two options to fdbserver which limit memory, --memory sets a resident limit and --memory-vsize sets the virtual memory limit.
Each machine in our cluster is equipped with 512GB memory.
The memory usage increases significantly as soon as the database (db) is up, and during the booting time, the CPU usage of each fdbserver process reaches approximately 100%. Once the cluster is back to a healthy state, there is no traffic from outside clients or internal traffic between fdbserver processes.
The tool reports resident memory usage, and /proc also gives that the resident memory usage of fdbserver is above 17GB. In one node, execute the following command, we get resident and virtual memory usage in KiB bytes unit (the first column is resident memory and the second column is virtual memory):
I see the problem. You have 49 TB of KV data and 56 processes. After triple replication, this is 49 * 3 / 56 = 2.6 TB of KV data that each process is responsible for.
There is a data structure that storage servers have called the Byte Sample which stores a deterministic random sample of keys. This data is persisted on disk in the storage engine and is loaded immediately upon storage server startup. Unfortunately, its size is not tracked or reported, but grows linearly with KV size and I suspect yours is somewhere around 4GB-6GB based on the memory usage I’ve seen for smaller storage KV sizes.
The byte sample size is technically configurable to a smaller sample rate, however changing its knob once a cluster is created is undefined behavior. FDB relies on the byte sample’s determinism to know how much logical data is in each shard and each storage server. Weird things will happen if you change this knob on a existing cluster and the cluster may become unavailable. I don’t think any data loss would occur but you could easily get into a situation that is hard to get out of.
If you need to reduce memory usage for you disk sizes you could reduce the cache memory setting.
If you want to reduce the size of the byte sample, you would have to create a new cluster and migrate your data to it. You would also have to make sure that no storage servers on the new cluster ever start up without the knob override. The option is knob_byte_sampling_factor and the default is 250. Multiply this by N to reduce the byte sample size by a factor of N.
Thank you for providing that valuable information. We’ll try it in our future benchmarking.
And, we are curious whether the data structure Byte Sample is a global data structure that each fdbserver process keeps one replica of, or if it is a local data structure that only contains random sample keys from its managed key-value pairs.