`PeerUnavailableForLongTime` after rolling to a new set of instances

We’re deploying FDB 7.1.23 on AWS, as a cluster of EC2 instances (set up via multiple autoscaling groups), with static ENIs (IPs) allocated to the instances.

When we want to make a change to the cluster structure, like increasing capacity or altering IOPS on the storage/log volumes, or changing the instance size, or similar, we do so by deploying a completely new set of instances set to that class, joining them to the cluster, then setting exclude on all the old instances of that class within the fdbcli. We wait until all the data has been migrated to the new instances and the old ones are no longer active within the cluster, then we shut down the ASG/all those excluded instances. After that, just to keep things ‘tidy’, we include the previously-excluded instances, so that our exclude list is empty.

This works absolutely fine, the cluster remains up throughout and any performance hit of the data migration does not affect our application.

The problem is that, once work is complete, we generally find PeerUnavailableForLongTime critical errors being spammed into the logs for the IP of one of the previously-excluded hosts, that has now been powered off. So… yes, the peer is unavailable, but it shouldn’t be an expected peer/part of the cluster any more.

If we saw this for every previous peer, I would think it was an indication we were doing something fundamentally wrong/broken. But we don’t. When rolling some 20-30 instances we generally get it for 1-2 of the old peers. And it’ll only be a similarly low number of the new/active peers doing all the complaining.

This is annoyingly hard to reproduce. It seems to happen fairly reliably if we spin up a large enough cluster and roll a sizeable set of the instances, but we can’t identify a set of conditions that will always cause a specific host to emit the error.

Is it a bug that FDB reports peerunavailable for hosts that it shouldn’t think are still part of the cluster? Is there something else we should be doing to get it to properly forget about those instances? Should be be leaving them excluded/excluded for long enough that any local knowledge of them clears out? The actual throughput we see on most of our clusters is pretty low, so rolling the log nodes only takes a few minutes, because there’s very minimal data to transfer.