Over the course of the week-end one of our cluster is starting to see a lot more load and some of the storage servers are using ~75% of CPU on average when some others are mostly idle.
I already mentioned this a bit in this: How accurate is readOpsPerSec when readSample is enabled?
But this post is focused on the apparent lack of rebalancing
Using the hotrange
command I can see that we have 2 ss with very different read profiles:
fdb> hotrange 10.0.49.70:4505 bytes "" "\xff" 1
[
{
"begin" : "",
"bytes" : 647490485,
"end" : "\u00FF\u0002/<snip>",
"readBytesPerSec" : 7500,
"readOpsPerSec" : 6124100
}
]
fdb> hotrange 10.0.235.187:4503 bytes "" "\xff" 1
[
{
"begin" : "",
"bytes" : 946779588,
"end" : "\u00FF\u0002/<snip>",
"readBytesPerSec" : 453333.33333333331,
"readOpsPerSec" : 4803600
}
]
I would have expected fdb
to actually rebalance the read hot shards to the one that are mostly doing nothing but it’s not happening.
One thing I suspect is happening is because all datadistribution is enabled that that the reblance_disk
is taking precedence on the rebalance_read
.
But so far I don’t have data to back this hypothesis.
Any advices or clues ?