We recently upgraded from FoundationDB 5.2.x (custom build by @panghy) to the public release 6.2.7
one of our clusters has been stuck moving data and in repartioning for days now, the disks are showing erratic data usage and I have been digging through the trace logs to figure out why it’s near constantly moving 100Gs of data but am not able to ascertain why.
you can see here that the constant line on the left is v. 5.2.x where even under load we are not moving much data, tens to hundreds of megabytes, once the upgrade is completed we’re near constantly moving hundreds of gigabytes.
Disk usage drastically changed around the time of the upgrade as well.
This is on an internal test cluster, we have pulled all load off of the cluster now to observe how the cluster behaves, and while the disk usage has calmed quite a bit we are still seeing a lot of data movement that doesn’t seem to have any end in sight.
I have poked through the trace logs across the machines but haven’t found any smoking guns yet. What other things should I be looking for here.
one additional strange note, if i re-run the command configure double ssd the configuration is always changed and the cluster goes back into reinit.
I think this is probably a result of a new feature added called “machine teams” that tries to form teams of hosts from which it picks the teams of processes for storage servers. When you upgrade, it goes through a migration process, and that’s probably what you’re experiencing.
There’s a trace event called TeamCollectionInfo which I think should give a little insight into what it’s doing.
This is on just one machine, i’ll go cull through the others, it looks like we’re at the total number of teams desired and TeamsToBuild is 0, though maybe something is stuck somewhere else. This has been ongoing for about two days now though.
This event should only be present on the data distributor, so you shouldn’t see it elsewhere. My guess is that with a machine team count of 163 and desired machine team count of 115, it’s currently in the process of removing machine teams. I think you might see evidence of that with this event: MachineTeamRemover, or you could see if the current machine team count is changing over time.
There is also some rebalancing work happening in your case, which is probably being created by the machine teams activity. As the machine teams are created and destroyed, the balance gets a little off and you’ll have some background work to clean this up.
I think both of these things are expected to take some time, and there’s also unfortunately an issue where the rebalancing part of that can sometimes get stuck. We’ve recently identified one reason for that, which should be resolved in an upcoming patch release. At this point, though, it looks like your rebalancing is chugging along like it should be if there is rebalancing work to do.
I don’t see anymore events for MachineTeamRemover and we’re still seeing a large amount of rebalancing. I’m going to continue to monitor into the weekend and we’ll see where we end up on monday, right now it’s still pretty status quo.
Rebalancing doesn’t happen super fast, so it can take a while if there’s a sizable imbalance. You can check StorageMetrics.BytesStored in your storage servers’ logs to see what the relevant imbalance looks like. You’ll probably continue to see this movement until they converge (or possibly you run into the issue where rebalancing gets stuck).
I think a lot of that is explainable by the machine teams restructuring, which would lead to various imbalances as it tries to move things around. If your CurrentMachineTeams==DesiredMachineTeams and its stop trying to remove teams, I would expect that it would be moving mostly towards convergence now unless some other effect (such as a workload) is unbalancing them.
Our workload hasn’t changed, and looking at the chart this has been ongoing since tuesday. I’m trying to nail down exactly how this may impact other large clusters (as this is one of large test clusters).
Will need to improve our metrics gathering pipeline to obtain deeper details into how this process works so we can be 100% sure of “if we see it in the future, it means this.” with that said, how long is it expected to take for this process to kick off and settle in?
During typical workload we see about 250MB of data to rebalance / move around, not 150 which slowly drops down to a few GB and then spikes back up.
but if what you’re saying all lines up then we’re now seeing CurrentMachineTeams="159" DesiredMachineTeams="115" in the trace logs, which could mean we’ll be seeing this for a while as it continues to do work.
I think it can take several days (a week or so?), depending on the size of your cluster. It’s probably a reasonably good estimate to extrapolate from the rate your machine teams have been decreasing to when they’ll reach 115.
The max shard size knob didn’t get ignored in 6.2 right? We were still on 100M (pre-5.x days) before the upgrade I would assume. If it jumped to 500M, there would be a lot of disk activity.
Interestingly enough we thought of trying to adjust that knob. Since we were actually at 250M on that cluster. We decided to disable it completely. (we had been moving data non stop for days at that point and the disk activity looked symptomatic of shard splitting). Spoiler: the flag change didn’t alleviate the situation hah.
Ah, so I really asked what’s going on and this is actually a non-impacting issue (meaning it’s not affecting GRV latencies so certainly not exhausting IO) so that’s different from changing the max shard size. Still worth investigating of course and seems like it’s done doing it.