We are running multiple FDB clusters on version 6.3.x, and we are planning to upgrade to 7.1.23. We are using the official kubernetes operator.
As I understand it, the proxy role has been split up in 7.x to grv_proxy/commit_proxy, and I am wondering if we have guidelines on how to set these three role counts in the FDB CRD such that the config is valid before and after an upgrade.
Furthermore, I am also curious how we should scale the cluster for decent performance after the upgrade. We are running lots of proxies, and they have high cpu usage, but we don’t know how this cpu usage will be split after the upgrade, so it is hard to know up front how to scale the two new roles.
We are running in a cloud environment, so the most prudent thing to do would perhaps be to
Record the number of proxies on 6.3 as X
Scale up to 2*X proxies on 6.3
Set up the cluster to run X grv_proxies and X commit_proxies after the upgrade.
Upgrade the cluster.
Scale down grv_proxies and commit_proxies based on actual cpu usage
I think your plan works. Our experience is that we can keep using the same number of total commit and grv proxies as before, and their CPU usage should be less than 6.3. The CPU usage reduction is more evident on GRV proxies. To give you a data point, our largest cluster uses 4 GRV proxies.
7.x tries to automatically calculates the number of proxies with some ratio between grv and commit, with a cap of 4 grv proxies.
To give some more context, our largest cluster is currently running 37 stateless pods, 34 proxies, and the stateless pods are using 70-90% cpu. This means we are probably vulnerable to performance problems if the default grv:commit is even a bit off.
By looking at the code, am I right in assuming:
6.3, grv_proxies and commit_proxies is ignored, only proxies is used
7.1, if defined, grv_proxies and commit_proxies are used and proxies is ignored. If not defined, proxies is used and split according to default ratio
Thus, by defining all three numbers, and make sure they add up, this will work fine before and after a migration. Is that the right take?
To give some more context, our largest cluster is currently running 37 stateless pods, 34 proxies, and the stateless pods are using 70-90% cpu. This means we are probably vulnerable to performance problems if the default grv:commit is even a bit off.
I’m not sure if adding 2x the proxies will help here or make the situation worse, given that in 6.3 all proxies have to communicate with each other. Is there another way for you to create a similar sized cluster and run some benchmarks with the same or similar workload?
By looking at the code, am I right in assuming:
6.3, grv_proxies and commit_proxies is ignored, only proxies is used
7.1, if defined, grv_proxies and commit_proxies are used and proxies is ignored. If not defined, proxies is used and split according to default ratio
Thus, by defining all three numbers, and make sure they add up, this will work fine before and after a migration. Is that the right take?