With apologies for the now-familiar introduction, we’re evaluating FoundationDB for some new use cases and are doing an initial load test. We’re using the FDB Kubernetes operator in three_data_hall configuration; we’re using AWS EKS with i3en.xlarge nodes. We’re observing a strange phenomenon where the placement of our resolver processes within the cluster dramatically affects their CPU usage.
When we change our database configuration (e.g. adding new storage processes), the various stateless roles tend to get redistributed among our processes with stateless process classes. When that happens, our resolvers can move to new Kubernetes nodes, and that can lead to enormous swings even with consistent write load. Here’s one example:
In the case pictured here, we added some new commit proxies (not pictured), and that triggered a role shuffle. The resolvers moved, and their average CPU utilization jumped from ~50% to ~85% despite no changes in overall load. That change was durable and lasted until the next shuffle.
For context, here’s our database configuration:
database_configuration = {
  storage        = 75
  logs           = 6
  commit_proxies = 6
  grv_proxies    = 4
  resolvers      = 2
}
The resolver CPU utilization swings are a big deal for us because the resolvers are really the fundamental limiting factor for our use case. If our cluster-wide write capacity can get cut in ~half by some weird roll of the dice, that’s a biiiiiig problem.
I haven’t yet found any discussion about how resolver placement within a cluster might affect CPU utilization, and am wondering if there’s an obvious cause for this behavior. Without much knowledge of a resolver’s internals, it’s not obvious to me how placement would affect CPU utilization in particular. So far, we’ve observed that the low-utilization configuration happened when the resolvers were both in one particular availability zone (let’s call that az-1-a), and the high-utilization configuration happened when both were in a different availability zone (az-1-f).
My first wild guess is that this could be a network latency thing; do resolvers benefit from proximity to some other process? My other wild guess is that there’s something different about the hardware in one AWS availability zone versus another, and I’ll investigate that in parallel. In the meantime, is this a familiar/expected phenomenon?
