We operate FDB in AWS with EC2 instances and EBS for all storage. While we can scale out FDB many times by adding more EC2 instances, there are cases when we need to migrate all the data from 1 FDB cluster to another. We call this a “wiggle”. We wiggle for reasons like 1) the AWS instances have reached their EBS throughput limit and we upsize the instances, 2) the SQLite btree is too fragmented, etc. The wiggle is achieved by setting up a new cluster, starting with its processes excluded. And then including new servers’ processes and excluding old servers’ processes until all old servers’ processes are excluded.
We currently have a cluster with 288 storage procs + other roles and say 1.5K clients connecting it (say 1/2 short lived monitoring and 1/2 persistent clients). This is the largest cluster we’ve built thus far. We started the wiggle over to a new cluster by adding a similar number of worker processes as excluded. As soon as we did that the cluster went down. We went through several recoveries and cluster coordinator re-elections. And we had to shutdown the new cluster processes in order to get the old cluster to recover successfully.
We suspect that the cluster controller was overloaded, and unable to heartbeat with the coordinators in a timely matter. And the coordinators nominated a new CC. This kept happening multiple times. We also saw the CC controller being trigger happy and finding a better master and killing the master with log messages of type BetterMasterExists. And then of course the controller itself got replaced.
- Have you seen similar issues with large clusters?
- There is clearly a limit with the number of workers a CC can handle. Adjusting timeout knobs can probably make this scale higher.
- Wiggles from one large cluster to another need different and more cumbersome steps.
Pl. can you share your experiences with large clusters? And how do you achieve migration to another larger cluster if you do that?