I’m pretty new to FDB and I have some confusion here on the storage engine. Please correct me if I’m wrong and asking dumb questions.
For memory, fdb tries to fit everything into memory, but at the meantime it also logs data to disk for backup. So when the data size exceeds the memory, what will fdb do? Does it function like a cache that oldest data is popped out? And when I want to access that piece of data fdb will look it up in the disk and thus causing a long latency?
For ssd, does fdb commit all transactions directly to disk? Does it still have the memory functioning as a cache?
I’m currently doing a benchmarking on FDB and met some problem here. I’m using 8 EC2 m5a.large each with 2 vCPUs and 8GB Memory. And I attached 4000GB gp2 disk to each of them when launching (on “/dev/sda1”). I have only one fdbserver running on each instance. Triple Redundancy.
I’m doing a write benchmark, with each key size of 256 bytes and value size of 1000 bytes. At the very beginning, the throughput reaches ~3000 ops/sec but it quickly drops to <1000 ops/sec in less than 20 seconds. I’ve monitored the status details from fdbcli as following,
When memory storage engine is used, disk will not be used for storage server. When the entire data size (including replicated data) is approaching the memory limit, rate keeper will kick in and throttle the transactions, just as it does for other storage engines. Memory is not used as cache for disk.
When ssd storage engine is used, storage server will “cache” the most recent mutations and KVs in memory and flush the memory content into ssd disk in batch. So memory is used as cache in this situation.
Clearly, the storage servers are limited by the IOPS. Since the cluster has only 2*8 = 16 processes, you many want to increase the number of machines so that the new cluster have more storage servers. If your workload is distributed evenly across these storage servers, you can decrease the IOPS demand on each storage servers.
Just to be clear, the disk is not used to answer read requests in the memory storage engine. All writes are durably written to disk and will be recovered into memory from disk if the process restarts.
I asked @ajbeamon about the difference in person and he gave a great explanation. I summarized it here for record:
For memory storage engine, the data size it can hold is up to the memory size (say 512GB). All read and writes are served directly from memory. Similar to Redis, it takes snapshot (and mutations) of the memory and make it durable to disk so that the storage engine can recovery its memory content from disk in case of process crash.
For ssd storage engine, the data size it can hold is up to the disk size (say 2TB). A certain amount of memory (say 2GB) is used as a cache, which caches the most recent read and write in PageCache. If you read a lot of data, the new data read from disk will be swapped into the memory and the old data in memory will be swapped out. The memory also holds 5-second multi-version data to serve read requests.
WARNING: A single process is both a transaction log and a storage server.
For best performance use dedicated disks for the transaction logs by setting process classes.
I know I have to configure some process class at this point, but how could I approach?
If you use fdbmonitor, then you can configure process classes as described here: Configuration — FoundationDB 7.1. There’s a link in the description of the class parameter describing recommended configurations.
If you don’t use fdbmonitor, you can also just do this directly as an argument to fdbserver by doing something like --class storage.
Running storage server and tLog on the same process will hurt performance a lot:
They have different disk access pattern, and they compete the disk IO.
Can you try to make sure 1 storage server per process and see if the performance is improved?
This problem is resolved now. However I got some other problem here… When the server memory reached 7.0/7.1 GB RAM (from fdbcli < status details command), the server will crash and become unreachable, then the database starts to “healing”. After a while the server will come back again magically… What can I do about this?
It seems to me that the storage server process runs out of memory. Is this a recurring thing or a one-time thing?
As far as I know, two storage servers may be recruited on the same process, doubling the process’s memory usage. This could be a reason that causes out of memory.
Can you check the location/(ip:port) address of the storage servers from fdbcli status json?