We recently upgraded a cluster from 5.2.x to 6.2.x (6.2.7 then 6.2.15) and since we use the memory engine extensively (as well as the SSD engine), we noticed the following happening:
- mem tier FDBCLI Latency increases 2-3x across all mem shards
- TCP TIME_WAIT increases across all instances
- mem tier Storage CPU increases 2x across all nodes
- mem tier Transaction CPU decreases by almost 2x across all mem shards (this is likely due to memory-2)
- ssd tier Storage CPU increases 2x
- ssd tier Transaction CPU decreases by almost 3x
- ssd and mem Master/CC CPU usage decreases by ~0.5x (likely from the JSON serialization improvements)
The most alarming aspect of the upgrade is that the cpu time on storage servers increased significantly. It’s showing patterns of hot spots whereby a subset of processes would be pegged at 5s while others are fine. This behavior, however, simply shifts between different storage processes (so there’s always some SS that are deemed unreachable because of their CPU load).
From what I can tell, memory-1 and memory-2 shares the same underlying storage so my hunch is that something changed in data distribution such that perhaps it is now either splitting or merging too aggressively.