@alexmiller@ajbeamon going through this matrix again, a few questions came up that I had probably not thought about earlier: How does FDB decide how many SS to recruit in the cluster? Will it try to recruit one SS on each process class where it is allowed to do so (that is everything except for those with NeverFit)?
Specifically, will an SS be recruited on Log/Transaction classes too (since the fitness for SS role on these classes is WorstFit) ?
This is a great chart! It has made me sit and ponder why some of these are never instead of worst. I’m not actually sure why we ban cluster controllers on Transaction…
In 6.1 there will be two more roles/classes, for Data Distribution and Ratekeeper, as they’ve now been split off into separate roles from the master. (Thanks @jzhou!)
DataDistribution will recruit as many storage servers as it can from processes that match Storage at Unset or better. WorstFit class processes will only be used there are no healthy storage teams left, so we need to make drastic actions to try and restore fault tolerance back to ReplicationFactor copies of data.
(The exact line you’re looking for is this one. Note that the process class check is missing in the criticalRecruitment case underneath it.)
Thanks, @alexmiller! let me recheck the matrix from the code - I see that the code has been updated since I last checked (and I may have made some errors too). Specifically, I now see that process class TransactionClass is OkayFit for ClusterController role.
[Edit: Updated the matrix in the top post - this time I generated it programmatically, so hopefully there are no errors now )
Resolution being an OkayFit for LogRouter is actually confusingly correct. Log routers will only ever be recruited in remote datacenters, and resolvers will only ever be recruited in the primary datacenter, so processes dedicated to hold resolvers in remote datacenters might as well be used for log routers if we need more processes.
I don’t understand the lack of symmetry between master and cluster controller though. Why do we say Proxy is an OkayFit for cluster controller, but not for master? Why do we prefer to co-locate the master with a resolver on a Resolution class process, rather than co-locate it with the cluster controller? With data distributor split out from the master, the impact of hosting a master should be relatively low. @Evan, thoughts?
@gaurav, thanks again for building the table. The physical layout and coloring makes this much easier to see places where decisions might be weird. I wonder if we could maintain a similar table easily in the code or docs…
It would be very useful to have this in some way (esp. when things change between releases) in the docs. At some point this would become out-of-date and we need a way to update this.