Feedback on New Deployment Topology

ryanworl · January 11, 2022, 5:04pm

I’m interested in shrinking our minimum cluster size for new clusters and adding more fault tolerance. Today we deploy in a topology like this:

three_data_hall mode
5x pods running coordinator class
9x pods running unset class
1x pod running stateless class

All evenly spread across AZs in Kubernetes (not using the operator) using stateful sets for the coordinator and unset and a regular k8s deployment for stateless.

We’d like to switch to this topology:

three_data_hall mode
9x pods, all pods run unset class

We then change our automation that replaces dead coordinators to enforce that when adding more unset pods, coordinators can only run on pod ordinals 0-8. So all ordinals 9+ can run anything but a coordinator.

Is this is a reasonable strategy? The benefits to me seem pretty clear, which is that we get 9 coordinators at a lower cost than the previous deployment strategy. From what I can see in the code, there shouldn’t be any issues with running a coordinator in the same process as a TLog because of the event loop priorities making coordinator requests take priority over most everything else. Additionally, coordinators and TLogs are not really active at the same time.

Are there any downsides to this?

markus.pilman · January 11, 2022, 5:48pm

As you seem to know (and we should document this better), you should run with 9 coordinators if you use three_data_hall.

Running everything with unset is something that will simplify your topology and FDB will try to place the roles in a reasonable way. There are some drawbacks to this strategy though:

Every process will need a disk. You won’t know which processes will be recruited for storage/tlog and which will be stateless. If you have disks everywhere anyways, this might not be a problem. But if you use something like EBS, this could be very costly.
Every process will need the same amount of memory. This means, for example, that you can’t have high memory instances for storages and cheaper instances for stateless and tlog roles.
There’s a good chance that FDB will co-locate tlogs and storages in the same process (I am not 100% whether this is true).

ryanworl · January 12, 2022, 1:38pm

Yes, I am aware 9 is the intended number of coordinators for three_data_hall.

The net reduction of nodes here will save enough to make this a non-issue, but that could definitely be an important consideration for other users.
Our deployment model is “many small clusters”, so we’d probably just add more clusters instead of pursuing that kind optimization.
This already happens in our current deployment, which is annoying in that sometimes resource utilization is not balanced, but it hasn’t been an issue otherwise.

Thanks for your help!

alexmiller · January 13, 2022, 3:50am

Any chance you’d be able to detail the motivations behind or requirements that drove this?

ryanworl · January 13, 2022, 1:59pm

Our workload is fairly easy to partition across multiple clusters and we generally prefer the strict isolation this provides. This isn’t really related to FDB at all because we deploy lots of things this way.

Topic		Replies	Views
How should I choose coordination servers? Using FoundationDB	4	2166	April 24, 2019
Production optimizations Using FoundationDB	20	6531	August 15, 2018
Coordinator-only process Using FoundationDB	2	674	October 20, 2018
Proposal: Don't identify a coordinator based on its IP FoundationDB Core	25	8469	August 6, 2019
Deploying storage-only servers Using FoundationDB	4	713	May 10, 2019

Feedback on New Deployment Topology

Related topics