Performance Regression with 6.2.15

panghy · February 20, 2020, 4:42am

We recently upgraded a cluster from 5.2.x to 6.2.x (6.2.7 then 6.2.15) and since we use the memory engine extensively (as well as the SSD engine), we noticed the following happening:

mem tier FDBCLI Latency increases 2-3x across all mem shards
TCP TIME_WAIT increases across all instances
mem tier Storage CPU increases 2x across all nodes
mem tier Transaction CPU decreases by almost 2x across all mem shards (this is likely due to memory-2)
ssd tier Storage CPU increases 2x
ssd tier Transaction CPU decreases by almost 3x
ssd and mem Master/CC CPU usage decreases by ~0.5x (likely from the JSON serialization improvements)

The most alarming aspect of the upgrade is that the cpu time on storage servers increased significantly. It’s showing patterns of hot spots whereby a subset of processes would be pegged at 5s while others are fine. This behavior, however, simply shifts between different storage processes (so there’s always some SS that are deemed unreachable because of their CPU load).

From what I can tell, memory-1 and memory-2 shares the same underlying storage so my hunch is that something changed in data distribution such that perhaps it is now either splitting or merging too aggressively.

panghy · February 20, 2020, 4:44am

For a visual look at what is going on pre/post upgrade in terms of SS CPU in a test cluster (note that we have dedicated processes for master/CC/logs/proxies/resolvers so these are just SS role processes):

ajbeamon · February 20, 2020, 6:53am

Did this happen after upgrading from 5.2 or 6.2.7? Is the cluster done doing the machine team rebalancing that we discussed here? Constantly repartitioning (under no load) and moving large volumes of data

ajbeamon · February 20, 2020, 6:59am

There were some data distribution changes in recent patch releases, so if you have a lot of movement that’s not machine team work and wasn’t there before, that would be interesting. You could check the MovingData event, which lists movements queued by priority and can tell you what type of movements you’re doing now.

mengranewo · February 20, 2020, 7:08pm

This happened after we upgraded from 5.2 to 6.2.7
We didn’t see changes in data movement for memory storage engine, the following chart measures the total number of in-flight bytes and in-queue bytes

panghy · February 20, 2020, 10:12pm

Hm, I discovery is that the SS nodes are now assigned the proxy role (for those that are hot). There must be a new change in how roles are assigned (need to check the code to see, we did not change the number of transaction class processes in the cluster).

panghy · February 20, 2020, 10:33pm

So some details about this test cluster, we have 14/14/29 (P/R/L) set and we have 58 processes with class=transaction (hence all should fit and every machine has at least 2 of these processes). In the past that would be the case where the P/R/L would be kept on those processes (unless P/R now biases torwards processes and not on class=transaction processes and would prefer storage over transaction class processes).

ajbeamon · February 21, 2020, 12:10am

Are you using the storage class for the storage servers, or are you not setting a class?

panghy · February 21, 2020, 12:19am

Ah, they are unset. So per Roles / Classes matrix, Unset is better than Okay so that’s why it’s now there. That’s a change from 5.x then I suppose.

ajbeamon · February 21, 2020, 4:57pm

I looked into it, and it appears this did change between 5.2 and 6.0. As far as I can tell, though, it’s not documented even in the release notes, which is a bit of an oversight. The other natural place for something like this to go is in the upgrade notes: https://apple.github.io/foundationdb/administration.html#upgrading-from-5-2-x.

I’ve created an issue to add this change to the documentation:

spullara · March 3, 2020, 10:24pm

Should it prefer them? Why not revert this change?

Topic		Replies	Views
Constantly repartitioning (under no load) and moving large volumes of data Using FoundationDB	15	1636	January 15, 2020
Process class and machine sharing deployment questions Using FoundationDB	15	2615	September 3, 2019
Cluster tuning cookbook Using FoundationDB	26	8865	February 1, 2019
How to speed up balancing? Using FoundationDB performance	11	1537	August 21, 2019
Why doesn't my cluster performance scale when I double the number of machines? Using FoundationDB performance	20	3305	August 17, 2018

Performance Regression with 6.2.15

Related topics