Observing higher latency probes after disabling perpetual_storage_wiggle

While looking back at some old telemetry, we found an increase in the Worst Latency Probe metric that occurred exactly at the time that we disabled perpetual_storage_wiggle.

On Jan 16 at 10am we issued configure perpetual_storage_wiggle=0 on our cluster that had previously been running with perpetual_storage_wiggle=1. We didn’t make any other changes.

The average is notably higher for write (blue) and get read version (green). The variance seems to be higher as well. Since then, the latency has stayed steady at the new elevated level, and variance appears to be the same as well.

Otherwise the database is working fine. But our app would benefit from those few extra msec if we can get them back.

Any tips? Do you recommend we return to having perpetual_storage_wiggle enabled?

Version info:

FoundationDB CLI 7.2 (v7.2.0)
source version 5eae3be195ee5c1302878459d3d1d34282b1ee60
protocol fdb00b072000000

This is interesting. I don’t see direct connection between perpetual storage wiggle and write/GRV latencies. Perpetual storage wiggle will make storage servers more balanced and reduce “holes” in SQLite files. Write/GRV are measuring transaction system performance, which is separated from the storage servers.

Perpetual storage wiggle introduces extra data movement, which issues a small number of transactions and has minimum/no impact on write/GRV latencies. So I can’t explain why disabling the feature is bad for latencies.

We also see that our cluster’s generation increased at the same time we disabled perpetual_storage_wiggle. Is that expected?

That’s expected. perpetual_storage_wiggle is a database configuration, so modifying it triggers a transaction system recovery, thus increasing the cluster generation.

So then, could this scenario be occurring?

  1. A transaction system recovery starts
  2. New GRV proxies, commit proxies and tLogs are recruited as a part of Phase 3: RECRUITING, described in this document.
  3. The combination of new roles in new locations produces a higher latency due to the random placement of those roles geographically within different “availability zones” with a cloud provider?

During recovery (usually less than 5s), GRV latency can become longer. However, after the recovery reaches accepting commit state, GRV latency should be stable afterwards.