I’m looking to deploy FoundationDB on AWS and could really use some guidance. I’ve been digging through the docs and forums, but I’d love to hear from anyone who has hands-on experience with this.
Specifically, I’m curious about to learn the following-
Instance Types: What AWS instance types have you found to be the most effective for running FoundationDB? Is there a sweet spot between performance and cost?
Cluster Setup:Any tips on the best way to set up the cluster? How many nodes should I be looking at for a moderately sized application?
Networking: Are there any specific networking considerations I should be aware of when deploying on AWS? Any gotchas with VPCs, subnets, or security groups?
Storage Options: What storage solutions are you using on AWS? Should I stick with EBS, or is there a better option out there for FoundationDB? I was checking about Storage options and came across these resources AWS Deployment Options - #7 by yennieAWS devops interview questionsfoudationdb deploy aws and as per them I need to use NVMe ephemeral for storage, but I have no idea about it.
Backup and Recovery: How are you handling backups and recovery in your AWS deployment? Any tools or scripts you’d recommend?
Monitoring: What are you using for monitoring and alerting? Any particular services or setups that integrate well with FoundationDB on AWS?
Thanks in advance for any advice you can offer. Looking forward to hearing from you all!
At Adobe, we run FoundationDB clusters in both Azure & AWS clouds. See my responses to your questions below -
Storage options and instance types:
We only use ephemeral storage since the read/write performance using managed disk doesn’t meet our application requirements. One of the downsides of this choice is that storage and compute are aggregated i.e. storage and compute cannot be scaled independently.
i3 and i3en are the two instance types we evaluated on AWS. Both these instance types come with high-performance ephemeral local storage which can drive high throughput workloads. i3en instances have roughly 2.6 times more SSD storage than i3 instances. Depending on your ratio of storage vs throughput required, one might be a better option than the other.
Cluster setup:
Number of nodes will depend on a couple of factors. You will need to decide whether which replication/fault tolerance configuration you want to run in i.e. double, triple, three_data_hall or multi_region. I covered this topic in my FDB meetup talk, you might find it useful. You can also join this meetup group to be notified of future meetup events.
For example, if you want to store 2TB of data (say 1B x 2KB records) with replication factor 3, go with a 20 x i3 node cluster running in triple configuration. This will result in ~ 40-50% disk utilization (3x data replication with the redwood tree storage amplification ends up closer to 7.5x). Assume 5 nodes in the cluster will be non-storage (proxies, logs, stateless) and 15 will be storage-only. You can adjust the number of storage nodes depending on your storage and performance requirements.
For production deployment, use 3_data_hall or multi_region configurations to have the cluster span multiple AZs and/or regions for resiliency. Note inter-AZ and inter-region network bandwidth is also a cost consideration.
Also look into using compute-optimized instances (instead of i3s) for stateless nodes to save on cost.
Backup and Recovery -
FDB’s backup/restore tool works well with the s3 backend. In Azure we ran into issues using Azure blob backend, so we use s3proxy as a workaround.
We wrote a service that uses FDB tools to submit and monitor backup/restore jobs for various FDB clusters for operational convenience.
Networking:
FDB requires that FDB’s clients be able to connect to all (or most) of the nodes of the FDB cluster. In AWS, we use VNet peering, transit gateway would be another option.
Clients would also need the FDB cluster file to connect to the FDB cluster. You might have to build a service or registry for distributing the cluster file to clients.
Kubernetes: I don’t know if this is an option for you (and we don’t use it for FDB in production), but there is a kubernetes operator. We completed a PoC recently and I can confirm that for read/write workloads and backup/restore, we see comparable performance as a VM based deployment.