How to speed up balancing?

The situation is:

We first have 5 nodes. And write around 5T data in (5T disk space usage). Then we add two more nodes. So it will trigger rebalancing. But we noticed that the rebalancing is very slow even when there is no load at all. Is there a way to speed up this progress?

BTW, we find during this rebalancing progress, the cluster will have a degraded throughout. This is also why we want to speed up this progress.

Could you provide the output of this?

fdbcli --exec "status details"

This will show all the processes in the cluster and potentially expose where the bottleneck is.

Hi, currently we are suffering a strange behavior. After finishing the rebalancing, we re-test our 7 nodes cluster and get an approximate half OPS of our 5 nodes cluster. And the average latency is also higher.

The result on 5 nodes cluster is:
INSERT - OPS: 6693.0, Avg(us): 7736
UPDATE - OPS: 742.9, Avg(us): 9682
READ - OPS: 7433.6, Avg(us): 5164

The result on 7 nodes cluster is:
INSERT - OPS: 3299.8, Avg(us): 14903
UPDATE - OPS: 367.0, Avg(us): 15218
READ - OPS: 3664.8, Avg(us): 12031

We are wondering what could be the possible reason? And about the status details:
( 42% cpu; 34% machine; 0.000 Gbps; 73% disk IO; 2.0 GB / 7.9 GB RAM )
( 40% cpu; 34% machine; 0.000 Gbps; 73% disk IO; 1.4 GB / 7.9 GB RAM )
( 32% cpu; 34% machine; 0.000 Gbps; 66% disk IO; 2.5 GB / 7.9 GB RAM )
( 23% cpu; 34% machine; 0.000 Gbps; 67% disk IO; 3.3 GB / 7.9 GB RAM )
( 25% cpu; 34% machine; 0.000 Gbps; 67% disk IO; 3.3 GB / 7.9 GB RAM )
( 24% cpu; 34% machine; 0.000 Gbps; 67% disk IO; 3.3 GB / 7.9 GB RAM )
( 29% cpu; 34% machine; 0.000 Gbps; 69% disk IO; 3.2 GB / 7.9 GB RAM )
( 28% cpu; 34% machine; 0.000 Gbps; 69% disk IO; 3.3 GB / 7.9 GB RAM )
( 40% cpu; 29% machine; 0.000 Gbps; 65% disk IO; 3.1 GB / 7.9 GB RAM )
( 37% cpu; 29% machine; 0.000 Gbps; 65% disk IO; 2.6 GB / 7.9 GB RAM )
( 31% cpu; 29% machine; 0.000 Gbps; 73% disk IO; 2.5 GB / 7.9 GB RAM )
( 32% cpu; 29% machine; 0.000 Gbps; 65% disk IO; 3.3 GB / 7.9 GB RAM )
( 22% cpu; 29% machine; 0.000 Gbps; 65% disk IO; 3.4 GB / 7.9 GB RAM )
( 25% cpu; 29% machine; 0.000 Gbps; 65% disk IO; 3.3 GB / 7.9 GB RAM )
( 22% cpu; 29% machine; 0.000 Gbps; 66% disk IO; 3.3 GB / 7.9 GB RAM )
( 31% cpu; 29% machine; 0.000 Gbps; 65% disk IO; 3.4 GB / 7.9 GB RAM )
( 42% cpu; 32% machine; 0.000 Gbps; 66% disk IO; 3.4 GB / 8.0 GB RAM )
( 41% cpu; 32% machine; 0.000 Gbps; 65% disk IO; 3.7 GB / 8.0 GB RAM )
( 30% cpu; 32% machine; 0.000 Gbps; 65% disk IO; 0.3 GB / 8.0 GB RAM )
( 21% cpu; 32% machine; 0.000 Gbps; 65% disk IO; 0.4 GB / 8.0 GB RAM )
( 26% cpu; 32% machine; 0.000 Gbps; 67% disk IO; 4.9 GB / 8.0 GB RAM )
( 35% cpu; 32% machine; 0.000 Gbps; 66% disk IO; 5.0 GB / 8.0 GB RAM )
( 27% cpu; 32% machine; 0.000 Gbps; 67% disk IO; 4.7 GB / 8.0 GB RAM )
( 28% cpu; 32% machine; 0.000 Gbps; 65% disk IO; 5.5 GB / 8.0 GB RAM )
( 42% cpu; 32% machine; 0.000 Gbps; 64% disk IO; 3.9 GB / 8.0 GB RAM )
( 41% cpu; 32% machine; 0.000 Gbps; 64% disk IO; 3.9 GB / 8.0 GB RAM )
( 31% cpu; 32% machine; 0.000 Gbps; 61% disk IO; 0.3 GB / 8.0 GB RAM )
( 31% cpu; 32% machine; 0.000 Gbps; 62% disk IO; 0.2 GB / 8.0 GB RAM )
( 30% cpu; 32% machine; 0.000 Gbps; 62% disk IO; 4.9 GB / 8.0 GB RAM )
( 27% cpu; 32% machine; 0.000 Gbps; 61% disk IO; 4.9 GB / 8.0 GB RAM )
( 24% cpu; 32% machine; 0.000 Gbps; 62% disk IO; 5.0 GB / 8.0 GB RAM )
( 36% cpu; 32% machine; 0.000 Gbps; 61% disk IO; 4.7 GB / 8.0 GB RAM )
( 34% cpu; 29% machine; 0.000 Gbps; 65% disk IO; 3.5 GB / 7.9 GB RAM )
( 44% cpu; 29% machine; 0.000 Gbps; 65% disk IO; 3.4 GB / 7.9 GB RAM )
( 31% cpu; 29% machine; 0.000 Gbps; 67% disk IO; 0.3 GB / 7.9 GB RAM )
( 12% cpu; 29% machine; 0.000 Gbps; 67% disk IO; 0.2 GB / 7.9 GB RAM )
( 32% cpu; 29% machine; 0.000 Gbps; 68% disk IO; 4.3 GB / 7.9 GB RAM )
( 28% cpu; 29% machine; 0.000 Gbps; 59% disk IO; 4.0 GB / 7.9 GB RAM )
( 28% cpu; 29% machine; 0.000 Gbps; 59% disk IO; 4.3 GB / 7.9 GB RAM )
( 26% cpu; 29% machine; 0.000 Gbps; 60% disk IO; 4.4 GB / 7.9 GB RAM )
( 41% cpu; 32% machine; 0.000 Gbps; 73% disk IO; 3.1 GB / 7.9 GB RAM )
( 38% cpu; 32% machine; 0.000 Gbps; 73% disk IO; 3.7 GB / 7.9 GB RAM )
( 12% cpu; 32% machine; 0.000 Gbps; 72% disk IO; 0.6 GB / 7.9 GB RAM )
( 30% cpu; 32% machine; 0.000 Gbps; 74% disk IO; 0.2 GB / 7.9 GB RAM )
( 41% cpu; 32% machine; 0.000 Gbps; 70% disk IO; 4.4 GB / 7.9 GB RAM )
( 41% cpu; 32% machine; 0.000 Gbps; 71% disk IO; 4.3 GB / 7.9 GB RAM )
( 25% cpu; 32% machine; 0.000 Gbps; 70% disk IO; 4.5 GB / 7.9 GB RAM )
( 36% cpu; 32% machine; 0.000 Gbps; 70% disk IO; 4.7 GB / 7.9 GB RAM )
( 36% cpu; 28% machine; 0.000 Gbps; 62% disk IO; 2.7 GB / 8.0 GB RAM )
( 40% cpu; 28% machine; 0.000 Gbps; 62% disk IO; 2.4 GB / 8.0 GB RAM )
( 29% cpu; 28% machine; 0.000 Gbps; 60% disk IO; 0.3 GB / 8.0 GB RAM )
( 6% cpu; 28% machine; 0.000 Gbps; 60% disk IO; 0.3 GB / 8.0 GB RAM )
( 33% cpu; 28% machine; 0.000 Gbps; 60% disk IO; 3.9 GB / 8.0 GB RAM )
( 25% cpu; 28% machine; 0.000 Gbps; 60% disk IO; 4.1 GB / 8.0 GB RAM )
( 31% cpu; 28% machine; 0.000 Gbps; 60% disk IO; 4.0 GB / 8.0 GB RAM )
( 24% cpu; 28% machine; 0.000 Gbps; 60% disk IO; 4.3 GB / 8.0 GB RAM )

As far as I think, there is no bottleneck.

This generally shouldn’t be the case, but I think folk had noted that especially on platforms with a more limited IOPS budget (such as EBS), that rebalancing was tuned too aggressively. There were a set of knob changes added for 6.1, intended to make rebalancing not impact throughput. What version of FDB and in what environment are you running? (ie. cloud? bare metal?)

Can you pastebin a fdbcli --exec "status json" for me?

Hi, here is the result: https://drive.google.com/open?id=17GKDwQP1YcS4goUjoUVOb8KDfhfAjyxQ

As it is too big so I have to paste the link.

Can you answer this to provide an idea of what sort of disk you have backing your FDB?

Thanks!

Configuration

You’ve configured your database as logs=14 proxies=8 resolvers=2. When you expanded from 5 to 7 nodes, this would have automatically added 2 more logs and proxies that are already fighting for resources, so that probably explains the decrease in performance.

I’d suggest you instead do logs=3 proxies=3 resolvers=1. Ideally, one would run transaction log processes against a separate disk from storage servers, but with only 7 disks to work with, you’re in the uncomfortable situation where you have to have your logs and storage servers fight for IOPS. Reserving 3 nodes for logs in a 7 node cluster would be wasting too much space.

As for the resolver count, there’s almost never a reason to run with more than one resolver unless you’re doing an incredibly write heavy workload of many small keys.

If you happen to be running with a block storage service that’s already doing replication underneath FDB, then it’s also worth considering moving from triple to double.

Process Layout

You appear to be running process 0 and 1 as transaction, 2 as stateless, and 3-7 as storage. It appears that all instances on a host are pointed to the same disk? You’ll probably only see higher latency with no real gain from running multiple transaction logs per disk, as they’ll just fight for fsync()s. I’d suggest flipping this to process 0 as log, processes 1 and 2 as stateless, and 3-7 as storage.

I’m also a bit surprised that you need 5 storage servers to saturate your disk. I think we’ve typically seen 2 or maybe 3 be all that’s needed to drive an SSD at 100%, and that’s without a transaction log running on the same disk. I’d suggest trying reducing the number of storage servers you have running per disk, and seeing if the situation improves.

Hi, thanks for your advice, it’s really useful and help me to understand it better.

The environment is actually a cluster of Azure l8s_v2 VMs. Here is the related link: https://azure.microsoft.com/en-us/blog/announcing-the-general-availability-of-lsv2-series-azure-virtual-machines/

Basically we choose this machine as it has a NVMe which has a very high throughput. And doubtlessly we deploy our data directory on this disk.

I will try your advise, but I’m wondering one thing. As the document says it would be better to launch one fdbserver on every core, I think I should launch 8 processes on every machine. So for 5 nodes I will have 40 processes. But as you suggest, It looks like I need only 3(logs)+3(proxies)+1(resolvers)+3*5(storage)+1(cluster_controller)+1(ratekeeper)+1(data_distributor)+1(master)=26 processes. Should I just waste the rest cores?

Ah, NVMe. 5 storage servers per disk might be fine. FDB artificially limits itself to an IO parallelism of 31, IIRC, to avoid flooding SATA-attached SSDs. NVMe should theoretically be able to handle more storage servers per disk, but given that your cluster is already reporting a rather high 70% IO utilization (and keep in mind that there’s intentional pauses from fsync()s in there), it’s still worth the test of reducing the number and seeing if you get better latency.

Essentially, yes. I mean, test first that it improves the performance first, but it’s not uncommon to end up stranding cores in oddly sized cloud machines. FDB is a database, so it should be more network and disk IO limited than CPU.

I would take the “one fdbserver per core” as more of an upper bound than a lower bound. Running >1 fdbserver per core would be strictly worse. Running 48 fdbservers on a 48-core one disk machine would make no sense.

You could, rather harmlessly, run one fdbserver per core, and leave all the extra ones as stateless though. Once Create an FDB caching role is done, FDB could theoretically put them to good use, too.

Thanks for your explaination! I will try it.

Hi, I’ve configured two regions and started synchronizing these two regions. But the problem is the synchronization is too slow. The disk usage is only about 1%~2%. And I’ve checked that the network bandwidth is at least 100Mb between two computers.
Is there any way to speed up the synchronization?

The initial synchronization is done through data distribution, so you’d want to basically revert the knob changes in the PR I previously linked via overriding them on the command line, and potentially go even further in that direction. You’d then probably want to undo the knob changes afterwards so that data movement in your cluster doesn’t impact your workloads.

Thanks for your reply. As I am testing, I don’t mind if it impacts my workloads.
Based on your reply, I guess currently we don’t have an official way to improve this problem :frowning: