How to set proxies, grv_proxies and commit_proxies during 6.3 -> 7.1 upgrade

We are running multiple FDB clusters on version 6.3.x, and we are planning to upgrade to 7.1.23. We are using the official kubernetes operator.

As I understand it, the proxy role has been split up in 7.x to grv_proxy/commit_proxy, and I am wondering if we have guidelines on how to set these three role counts in the FDB CRD such that the config is valid before and after an upgrade.

Furthermore, I am also curious how we should scale the cluster for decent performance after the upgrade. We are running lots of proxies, and they have high cpu usage, but we don’t know how this cpu usage will be split after the upgrade, so it is hard to know up front how to scale the two new roles.

We are running in a cloud environment, so the most prudent thing to do would perhaps be to

  1. Record the number of proxies on 6.3 as X
  2. Scale up to 2*X proxies on 6.3
  3. Set up the cluster to run X grv_proxies and X commit_proxies after the upgrade.
  4. Upgrade the cluster.
  5. Scale down grv_proxies and commit_proxies based on actual cpu usage

I think your plan works. Our experience is that we can keep using the same number of total commit and grv proxies as before, and their CPU usage should be less than 6.3. The CPU usage reduction is more evident on GRV proxies. To give you a data point, our largest cluster uses 4 GRV proxies.

7.x tries to automatically calculates the number of proxies with some ratio between grv and commit, with a cap of 4 grv proxies.

I agree with @jzhou and your post reminded me about this issue: Document role count interaction between proxies and grv/commit proxies · Issue #1180 · FoundationDB/fdb-kubernetes-operator · GitHub :slight_smile: Here is the interesting code part fdb-kubernetes-operator/foundationdb_database_configuration.go at main · FoundationDB/fdb-kubernetes-operator · GitHub. You probably don’t have to scale the proxies to 2*X and you just can leave the number of proxies as they are and take the default ratio for grv:commit. You can then try to optimize the setup by specifying exact numbers of the different roles. If the number of proxies is the same as the total number of grv + commit proxies the operator only needs to configure the DB.

Thank you both for responding.

To give some more context, our largest cluster is currently running 37 stateless pods, 34 proxies, and the stateless pods are using 70-90% cpu. This means we are probably vulnerable to performance problems if the default grv:commit is even a bit off.

By looking at the code, am I right in assuming:

  • 6.3, grv_proxies and commit_proxies is ignored, only proxies is used
  • 7.1, if defined, grv_proxies and commit_proxies are used and proxies is ignored. If not defined, proxies is used and split according to default ratio

Thus, by defining all three numbers, and make sure they add up, this will work fine before and after a migration. Is that the right take?

Thank you both for responding.

To give some more context, our largest cluster is currently running 37 stateless pods, 34 proxies, and the stateless pods are using 70-90% cpu. This means we are probably vulnerable to performance problems if the default grv:commit is even a bit off.

I’m not sure if adding 2x the proxies will help here or make the situation worse, given that in 6.3 all proxies have to communicate with each other. Is there another way for you to create a similar sized cluster and run some benchmarks with the same or similar workload?

By looking at the code, am I right in assuming:

  • 6.3, grv_proxies and commit_proxies is ignored, only proxies is used
  • 7.1, if defined, grv_proxies and commit_proxies are used and proxies is ignored. If not defined, proxies is used and split according to default ratio

Thus, by defining all three numbers, and make sure they add up, this will work fine before and after a migration. Is that the right take?

Correct.