FDB cluster rebalance endless loop

CodingSuen · August 1, 2022, 3:05am

fdb test env on ec2
storage process: 12
transaction process: 5
stateless process: 10
Redundancy mode: triple
storage engine ssd
version: 7.1.0
We use mako to do benchmark, mostly write tests.
Then we find something wired about rebalancing, fdb keep rebalancing even if we stop the test.
As I understandb, if there are 120G data, for 12 storage processes, each storage process should store 10G data, data volume should be more or less the same.
But for our production env, only two storage process do the “rebalance”, “rebalance” means the first storage process moves almost all datas to the second storage process, then the second storage process moves datas back to the the first storage process, like a endless loop…

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 5
  Desired Logs           - 5
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 27
  Zones                  - 16
  Machines               - 16
  Memory availability    - 7.3 GB per process on machine with least available
  Retransmissions rate   - 0 Hz
  Fault Tolerance        - 2 machines
  Server time            - 08/01/22 10:34:29

Data:
  Replication health     - Healthy (Rebalancing)
  Moving data            - 7.174 GB
  Sum of key-value sizes - 29.963 GB
  Disk space used        - 109.617 GB

Storage wiggle:
  Wiggle server addresses- 10.5.120.55:4500
  Wiggle server count    - 1

Operating space:
  Storage server         - 454.1 GB free on most full server
  Log server             - 82.7 GB free on most full server

Workload:
  Read rate              - 104 Hz
  Write rate             - 9 Hz
  Transactions started   - 23 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.5.120.44:4500       (  0% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 0.1 GB / 7.3 GB RAM  )
  10.5.120.44:4501       (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 0.1 GB / 7.3 GB RAM  )
  10.5.120.45:4500       (  2% cpu;  1% machine; 0.002 Gbps;  0% disk IO; 0.2 GB / 7.3 GB RAM  )
  10.5.120.45:4501       (  1% cpu;  1% machine; 0.002 Gbps;  0% disk IO; 0.1 GB / 7.3 GB RAM  )
  10.5.120.46:4500       (  2% cpu;  1% machine; 0.022 Gbps;  7% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.46:4501       (  1% cpu;  1% machine; 0.022 Gbps;  4% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.47:4500       (  1% cpu;  3% machine; 0.001 Gbps;  6% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.47:4501       ( 11% cpu;  3% machine; 0.001 Gbps;  8% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.48:4500       (  1% cpu;  1% machine; 0.000 Gbps;  2% disk IO; 2.0 GB / 8.0 GB RAM  )
  10.5.120.49:4500       (  1% cpu;  1% machine; 0.000 Gbps;  9% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.49:4501       (  5% cpu;  1% machine; 0.000 Gbps;  8% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.50:4500       (  1% cpu;  1% machine; 0.000 Gbps;  2% disk IO; 2.0 GB / 8.0 GB RAM  )
  10.5.120.51:4500       (  1% cpu;  2% machine; 0.041 Gbps; 21% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.51:4501       (  4% cpu;  2% machine; 0.041 Gbps; 20% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.52:4500       (  3% cpu;  2% machine; 0.001 Gbps; 19% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.52:4501       (  2% cpu;  2% machine; 0.001 Gbps; 20% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.53:4500       (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.1 GB / 7.3 GB RAM  )
  10.5.120.53:4501       (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.1 GB / 7.3 GB RAM  )
  10.5.120.54:4500       (  2% cpu;  1% machine; 0.003 Gbps;  0% disk IO; 0.1 GB / 7.3 GB RAM  )
  10.5.120.54:4501       (  1% cpu;  1% machine; 0.003 Gbps;  0% disk IO; 0.1 GB / 7.3 GB RAM  )
  10.5.120.55:4500       (  5% cpu;  3% machine; 0.023 Gbps; 37% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.55:4501       (  7% cpu;  3% machine; 0.023 Gbps; 34% disk IO;13.5 GB / 14.0 GB RAM  )
  10.5.120.56:4500       (  1% cpu;  0% machine; 0.000 Gbps;  1% disk IO; 2.0 GB / 8.0 GB RAM  )
  10.5.120.57:4500       (  1% cpu;  1% machine; 0.000 Gbps;  2% disk IO; 2.0 GB / 8.0 GB RAM  )
  10.5.120.58:4500       (  1% cpu;  0% machine; 0.000 Gbps;  0% disk IO; 0.7 GB / 7.4 GB RAM  )
  10.5.120.58:4501       (  1% cpu;  0% machine; 0.000 Gbps;  0% disk IO; 0.1 GB / 7.4 GB RAM  )
  10.5.120.59:4500       (  1% cpu;  1% machine; 0.000 Gbps;  2% disk IO; 2.0 GB / 8.0 GB RAM  )

Coordination servers:
  10.5.120.46:4500  (reachable)
  10.5.120.47:4500  (reachable)
  10.5.120.49:4500  (reachable)
  10.5.120.51:4500  (reachable)
  10.5.120.55:4500  (reachable)

Client time: 08/01/22 10:34:29

alexmiller · August 4, 2022, 4:44am

This looks like perpetual storage wiggle is enabled. Please see Perpetual Storage Wiggle — FoundationDB 7.1, and try running fdbcli> configure perpetual_storage_wiggle=0 to disable this behavior.

CodingSuen · August 4, 2022, 6:24am

fdb> configure ssd
ERROR: Storage engine type cannot be changed because storage_migration_type=disabled.
Type 'configure perpetual_storage_wiggle=1 storage_migration_type=gradual' to enable gradual migration with the perpetual wiggle, or 'configure storage_migration_type=aggressive' for aggressive migration.

We want to use ssd engine, as the return info:
1.configure perpetual_storage_wiggle=1 storage_migration_type=gradual
2.configure storage_migration_type=aggressive
We choosed the first way, because from offcial document, “gradual” is the recommended method for all production clusters.
https://apple.github.io/foundationdb/command-line-interface.html#storage-migration-type
So is it a normal case?

CodingSuen · August 5, 2022, 3:04am

Never mind, I figured out also thanks a lot.

Topic		Replies	Views
How to speed up balancing? Using FoundationDB performance	11	1538	August 21, 2019
Migrating from a large cluster to another Using FoundationDB	14	2417	November 6, 2018
Data Distribution Stopped - How to Restart? Using FoundationDB	13	1877	November 12, 2019
Why doesn't my cluster performance scale when I double the number of machines? Using FoundationDB performance	20	3317	August 17, 2018
Storage server running out of space Using FoundationDB	16	4050	October 2, 2019

FDB cluster rebalance endless loop

Related topics