AWS Deployment Options

yennie · June 19, 2019, 9:27pm

So, I have been stress testing rev 1 of deploying FDB to AWS. This subject is documented in it’s own way in a load of forums, I’m surprised there is no single stop documentation or templates for spinning up a reliable cluster.

Rev 1:

3 x servers c5d.large (small, but for testing).
Use NVMe ephemeral for storage, double configuration for redundancy. Backup to s3.
Let FDB determine process types.

I made cloudformation scripts to:

launch a first server
mount the nvme drive
install fdb, change config to point logs and data to nvme drive, run 2 processes (4500, 4501)
push cluster file to s3
launch remaining servers
repeat 2-3
pull cluster file from s3
configure ssd, double

This hummed along fine for some time. Then one server crashed. I rebooted the server, the cluster returned, but could not confirm data health. Then I got an OOM fault from the same server on startup. The cluster limped along and I tried to find ways to get the health back. Next, the server bricked. There was a data corruption on the ephemeral drive (I think) and AWS dumped the disk and rebooted with a new one: this destroyed all data and bricked the server, because the /etc/fstab entry for the nvme drive was pointing to a uuid that no longer existed.

Is anyone running FDB on NVMe ephemeral drives and run into similar issues? Is EBS the only option for storage?

ryanworl · June 19, 2019, 9:43pm

I know of a user running on ephemeral disks who saw an IO controller failure or similar fault that resulted in a hung cluster because the disk had a tlog on it. Other than that instance (which was fixed by killing the process), I have not heard of complaints similar to yours. This was also on the c5d family I think except larger instances.

Your configuration of only 4GB of memory and 2 processes is not ideal. The recommended is 4GB per process minimum. Instead of running 2 processes per host on that instance size try just running one. This is probably why you OOMd if there was an active workload stressing the cluster. If it were idle that would be another story…

alexmiller · June 19, 2019, 10:19pm

Oh, that actually is in our docs.

This should be 8GB of memory per fdbserver process, which is what the –memory flag defaults to.

Tangential, but degraded TLogs, which was released as part of 6.1 should have fixed this.

ryanworl · June 20, 2019, 12:00am

Awesome! I just informed them.

Also fdbcli warns you if any process has under 4GB of memory. Should that be updated to 8GB as the warning threshold?

yennie · June 20, 2019, 2:12pm

100% agreed. The config is simply for a few stages of testing. I’m just ironing out the automation bugs and testing backup and restore under a minimal workload; thought I’d save some pennies with the small instances. The OOM was surprising, but it appears some processes are memory intensive (looks like transaction servers). The intent WAS to move to the following configuration:

3 x i3.xlarge - storage class
3 x i3.large - transaction class
3 x c5.xlarge - stateless class

But seeing the input of 8GB per process is the actual requirement / process. I’ll have to rethink this.

What is making this difficult is that the instance bricked. No ssh, no logs, etc. Here’s to hoping the original failure that led to the disk mount error was simply the OOM.

For some background on this project: I selected FDB for the flexibility, scalability, and speed I can gain on some unique data models that are used for ML and BI. A lot of the big data models are write once read many. Using the ephemeral NVMe drives gives us a cost and speed boost that is worth it if it can work. Since the development team is small, our API also uses FDB for our resources. After a series of events, I’m now having to take on the deployment of FDB in the cloud. I need to become a FDB DBA (unless someone has made a managed FDB product in the cloud??). Right now, I’m learning all I can by pouring through these forums and finding bits of information and piecing them together. Any interest in starting a “best practices in cloud deployment” guide? I’d help all I can.

ryanworl · June 20, 2019, 3:00pm

For roughly the same budget I would deploy 5 i3.xlarge. The problem with specializing machines for roles in a small cluster like that is you’ve actually increased the number of machines which will cause a recovery during any of their failures without gaining any extra machine redundancy for your data. I would also advise a triple replicated configuration instead of double unless this data is being imported from some external source you can easily get back.

That configuration with (3 x 2) cores of transaction processes is probably more than you’d need to service (3 x 4) cores of storage processes.

If you must specialize machines for roles I would put transaction and stateless together and onto as few servers of the largest servers you can.

yennie · June 20, 2019, 3:44pm

Fair point. The idea of configuring so many transaction processes was to be able to scale as our customer base converts to this solution. More data?: Just up the storage servers. In my mind a tiered scaling approach would make changes manageable up to a point where scaling the cluster’s core infra would be a great problem to have. It is possible for us to scale from ~1’sTB to ~100’sTB of managed data demand within a year or two.

This does seem overly complicated of a scheme, now. I can always configure machine roles later if needed. Thanks for the input! I’ve outlined where I’m getting my info. If you have any input on additional resources to find best practices, I’d appreciate it.

Road map:

FDB Fundamentals
a. Docs
b. Forum
c. https://www.youtube.com/watch?v=16uU_Aaxp9Y&list=PLbzoR-pLrL6q7uYN-94-p_-Q3hyAmpI7o
Select Cloud Infra and Config
Backup and Restore to Blob Store/Disaster Recovery
a. Docs
b. Fdbbackup expire with blobstore returning error, not deleting from S3
Automate and Secure Deployment
a. Infra as Code - Done
b. Networking Security and TLS
Tuning Servers
a. Forums
Updates and Maintenance
a. Docs

Topic		Replies	Views
Production optimizations Using FoundationDB	20	6515	August 15, 2018
Scripts to deploy, benchmark, and tinker with 1M operations/sec FoundationDB cluster on AWS Using FoundationDB	13	4333	November 6, 2018
FDB Server deployment resources? Using FoundationDB	9	2679	May 5, 2018
Scripts to spin up FoundationDB cluster on AWS Using FoundationDB	2	1959	November 1, 2018
WARNING: A single process is both a transaction log and a storage server Using FoundationDB	16	1817	August 13, 2019

AWS Deployment Options

Related topics