Upgrade never completes when storageServersPerPod > 1

Hey Folks.

(Bug report in lieu of github issues)

I have storageServersPerPod set to 2 which gives me 12 storage processes spread across 6 pods.

When I attempt to upgrade from 7.1.21 → 7.1.27 the operator 1.4.0 [CORRECTION: 1.14.0] gets stuck with the following message:
"expected 25 processes, got 31 processes ready to restart" as the reason for delaying the upgrade; it never gets beyond this.

Some brief maths later and i found that the 25 was the number of pods i had running, and 31 was the number of processes I had running, the additional 6 being the doubled up storage nodes.

The code that is causing the upgrade to wait indefinitely is the following:

It looks like it’s expecting the total number of addresses to neatly align with the exact process count, but in my case I have more ready processes than expected!

I reduced storageServersPerPod to 1 successfully, which brought the address count down to exactly 25 and the upgrade immediately progressed. Afterwards I scaled in again afterwards back to 2 per pod, which was a feasible workaround.

Thanks for reporting that issue! The operator version 1.4.0 should work without any issues, as the check is not present there. For the operator version 1.14.0 and newer you are right, the counts.Total() method doesn’t take the storageServersPerPod into account.I open a GitHub issue for that and we should be able to fix that soon.

Created this issue: Upgrade blocked when storageServersPerPod > 1 · Issue #1535 · FoundationDB/fdb-kubernetes-operator · GitHub

JFYI: We have a fix for this bug: Fix upgrade logic for multiple storage server per Pod by johscheuer · Pull Request #1538 · FoundationDB/fdb-kubernetes-operator · GitHub. Thanks again for reporting!