Some Clarification on Storage Engine and Disk/IO

(Zoe) #1

I’m pretty new to FDB and I have some confusion here on the storage engine. Please correct me if I’m wrong and asking dumb questions.

For memory, fdb tries to fit everything into memory, but at the meantime it also logs data to disk for backup. So when the data size exceeds the memory, what will fdb do? Does it function like a cache that oldest data is popped out? And when I want to access that piece of data fdb will look it up in the disk and thus causing a long latency?

For ssd, does fdb commit all transactions directly to disk? Does it still have the memory functioning as a cache?

I’m currently doing a benchmarking on FDB and met some problem here. I’m using 8 EC2 m5a.large each with 2 vCPUs and 8GB Memory. And I attached 4000GB gp2 disk to each of them when launching (on “/dev/sda1”). I have only one fdbserver running on each instance. Triple Redundancy.

I’m doing a write benchmark, with each key size of 256 bytes and value size of 1000 bytes. At the very beginning, the throughput reaches ~3000 ops/sec but it quickly drops to <1000 ops/sec in less than 20 seconds. I’ve monitored the status details from fdbcli as following,

Process performance details:
  100.90.8.233:4500      ( 33% cpu; 20% machine; 0.040 Gbps; 92% disk IO; 5.8 GB / 7.1 GB RAM  )
  100.90.9.50:4500       ( 15% cpu; 10% machine; 0.016 Gbps; 92% disk IO; 4.3 GB / 7.1 GB RAM  )
  100.90.11.115:4500     ( 15% cpu; 10% machine; 0.008 Gbps; 93% disk IO; 4.2 GB / 7.2 GB RAM  )
  100.90.33.45:4500      ( 33% cpu; 21% machine; 0.046 Gbps; 94% disk IO; 6.2 GB / 7.1 GB RAM  )
  100.90.35.228:4500     ( 24% cpu; 16% machine; 0.030 Gbps; 93% disk IO; 4.4 GB / 7.1 GB RAM  )
  100.90.46.60:4500      ( 45% cpu; 28% machine; 0.065 Gbps; 93% disk IO; 5.8 GB / 7.1 GB RAM  )
  100.90.47.93:4500      ( 32% cpu; 19% machine; 0.043 Gbps; 87% disk IO; 4.2 GB / 7.0 GB RAM  )
  100.90.54.82:4500      ( 13% cpu;  9% machine; 0.009 Gbps; 90% disk IO; 4.2 GB / 7.2 GB RAM  )

I’ve noticed that the diskIO remains high at all times, so is it the bottleneck of my database? What do you suggest doing in my case?

Thank you in advance.

(Meng Xu) #2

When memory storage engine is used, disk will not be used for storage server. When the entire data size (including replicated data) is approaching the memory limit, rate keeper will kick in and throttle the transactions, just as it does for other storage engines. Memory is not used as cache for disk.

When ssd storage engine is used, storage server will “cache” the most recent mutations and KVs in memory and flush the memory content into ssd disk in batch. So memory is used as cache in this situation.

(Meng Xu) #3

Clearly, the storage servers are limited by the IOPS. Since the cluster has only 2*8 = 16 processes, you many want to increase the number of machines so that the new cluster have more storage servers. If your workload is distributed evenly across these storage servers, you can decrease the IOPS demand on each storage servers.

(A.J. Beamon) #4

Just to be clear, the disk is not used to answer read requests in the memory storage engine. All writes are durably written to disk and will be recovered into memory from disk if the process restarts.

1 Like
(Meng Xu) #5

I asked @ajbeamon about the difference in person and he gave a great explanation. I summarized it here for record:

For memory storage engine, the data size it can hold is up to the memory size (say 512GB). All read and writes are served directly from memory. Similar to Redis, it takes snapshot (and mutations) of the memory and make it durable to disk so that the storage engine can recovery its memory content from disk in case of process crash.

For ssd storage engine, the data size it can hold is up to the disk size (say 2TB). A certain amount of memory (say 2GB) is used as a cache, which caches the most recent read and write in PageCache. If you read a lot of data, the new data read from disk will be swapped into the memory and the old data in memory will be swapped out. The memory also holds 5-second multi-version data to serve read requests.

(Zoe) #6

What’s the optimal disk I/O rate for a 100% write test then? I’m not sure how many more machines I will need to add to the cluster right now

(Zoe) #7

I also encountered the warning

WARNING: A single process is both a transaction log and a storage server.
  For best performance use dedicated disks for the transaction logs by setting process classes.

I know I have to configure some process class at this point, but how could I approach?

(A.J. Beamon) #8

If you use fdbmonitor, then you can configure process classes as described here: https://apple.github.io/foundationdb/configuration.html#fdbserver-section. There’s a link in the description of the class parameter describing recommended configurations.

If you don’t use fdbmonitor, you can also just do this directly as an argument to fdbserver by doing something like --class storage.

(Meng Xu) #9

Running storage server and tLog on the same process will hurt performance a lot:
They have different disk access pattern, and they compete the disk IO.

Can you try to make sure 1 storage server per process and see if the performance is improved?

(Zoe) #10

I have manually set each process as one of storage, stateless, transaction. But the Warning is still there though.

(Meng Xu) #11

If then number of roles (e.g., storage servers, tLogs) is larger than the number of processes, some roles will be packed onto the same process.

Can you check how many roles and processes (stateless and stateful), especially class_type=storage, used in the cluster?