So, I have been stress testing rev 1 of deploying FDB to AWS. This subject is documented in it’s own way in a load of forums, I’m surprised there is no single stop documentation or templates for spinning up a reliable cluster.
3 x servers c5d.large (small, but for testing).
Use NVMe ephemeral for storage, double configuration for redundancy. Backup to s3.
Let FDB determine process types.
I made cloudformation scripts to:
- launch a first server
- mount the nvme drive
- install fdb, change config to point logs and data to nvme drive, run 2 processes (4500, 4501)
- push cluster file to s3
- launch remaining servers
- repeat 2-3
- pull cluster file from s3
- configure ssd, double
This hummed along fine for some time. Then one server crashed. I rebooted the server, the cluster returned, but could not confirm data health. Then I got an OOM fault from the same server on startup. The cluster limped along and I tried to find ways to get the health back. Next, the server bricked. There was a data corruption on the ephemeral drive (I think) and AWS dumped the disk and rebooted with a new one: this destroyed all data and bricked the server, because the /etc/fstab entry for the nvme drive was pointing to a uuid that no longer existed.
Is anyone running FDB on NVMe ephemeral drives and run into similar issues? Is EBS the only option for storage?