FoundationDB

FDB Server deployment resources?


(gaurav) #1

Hi there,

I read on hackernews that FDB server is a single threaded process (https://news.ycombinator.com/item?id=8729420). Is that correct?

If it is indeed a single threaded server, then what is a suggested deployment model for FDB server on multi-core machines if we want to utilize all cores? Should we run multiple FDB servers on every single machine in the cluster? But in that case, a single machine or disk going bad will bring down multiple FDB instances going down…

Is there any help/guidance on how should required memory be computed for the server instace? Is it more like - there is minimum X GB RAM that is required, and after that the more you have, the better will be performance (up-to a limit)? Or is there a more systematic way to reason about how much memory to allocate to a server instance?


thanks,
gaurav


(Mike McMahon) #2

Howdy Guarav,

We deploy FDB with multiple processes, each one consuming a core and working together on a particular machine. If say we have 400GB of RAM on the system we will divide the top half of the system (200GB) evenly into as many FDB Processes as we can depending on the needs of the storage procs and transaction procs present. Typically we run in a 2:1 fashion with 1 dedicated transaction proc for every two storage procs.

Our usecase on 3.x is as such

(totalMem / 2 / 14) = totStorageProcs # for memory database 
transProcs = (totStorageProcs/2)      # for memory database
ssdStorageProcs = totalStorageProcs   # ssd procs are ran in a configuration that is equal to the number of memory storage procs

14 is of course the amount of memory we allocate to the storage procs with the ssd procs using the default memory configuration unless more is needed.

This results in large scale deployments at petabyte scale :slight_smile:

the configuration of proxies/logs/resolvers is a bit of tweaking to find the right amount to keep the latency as low as possible.


(matthew zeier) #3

And allows us to run memory clusters that see > 800,000 writes/second and SSD clusters that do > 30,000 reads or writes /second.

Memory cluster:

SSD cluster:


(matthew zeier) #4

Some of the magic in running clusters of this size (in AWS) is how we use instance store SSD/NVMe as EBS caching.

We’ve had moving to ZFS & ZFS ARC on our wishlist for some time but need to work through how fdb opens files since ZFS doesn’t support O_DIRECT (specifically we removed O_DIRECT and replaced it with O_SYNC and then I chickened out on deploying that change to prod).


(matthew zeier) #5

Same cluster but we had did have a period last week were we bursted like 2x. Same hardware, no real changes.

And for context, this was with 9x i3.16xlarge with 16T disk (8x 2TB EBS in an LVM stripe).

Memory cluster (2x write):


(gaurav) #6

Thanks for the suggestions! These will be very useful once I move to deployment step.


(Richard Low) #7

You should also consider disk resources. Since transaction logs have to fsync writes before they are acknowledged you probably want to run the logs on dedicated disks to have consistent write performance.

For storage, it’s also reasonable to run one storage server process per disk. This lets you use more cores and, in fact, depending on your core to disk ratio (and probably disk performance) you may want to run multiple processes per disk to perform better in CPU bound workloads e.g. high throughput cacheable reads.

As for your comment on failures bringing down multiple processes - the main concern is probably the amount of data that needs to be replicated on a failure. If a whole machine fails, however many processes you run per machine won’t affect it. But a nice property of having processes not spanning disks is a disk failure only takes down the process(es) that use it so only that data needs to be replicated. This improves your reliability and reduces network traffic.

To add to the discussion on memory allocation, since the cache is part of the process memory (FDB bypasses OS filesystem caches) you should consider how much cache you want in the memory sizing calculation.


(Mike McMahon) #8

on the point of disks; for our SSD based databases we use LVM raids in a stripe w/ NVME caches (write-through) fronting each underlying volume (not the raid itself). We have found this provides the most performance and bang for your buck.

Our in memory databases use throughput optimized devices since we rarely read from them.


(Oliver Mannion) #9

What do you use to do your write-through caching? flashcache/enhanceIO?


(Clement Pang) #10

enhanceio with write-thru/read-only. that’s mainly because aws doesn’t differentiate between reads and writes to EBS volumes (so less reads means more writes).