Significant changes in CPU load on resolver processes depending on placement in a cluster

With apologies for the now-familiar introduction, we’re evaluating FoundationDB for some new use cases and are doing an initial load test. We’re using the FDB Kubernetes operator in three_data_hall configuration; we’re using AWS EKS with i3en.xlarge nodes. We’re observing a strange phenomenon where the placement of our resolver processes within the cluster dramatically affects their CPU usage.

When we change our database configuration (e.g. adding new storage processes), the various stateless roles tend to get redistributed among our processes with stateless process classes. When that happens, our resolvers can move to new Kubernetes nodes, and that can lead to enormous swings even with consistent write load. Here’s one example:

In the case pictured here, we added some new commit proxies (not pictured), and that triggered a role shuffle. The resolvers moved, and their average CPU utilization jumped from ~50% to ~85% despite no changes in overall load. That change was durable and lasted until the next shuffle.

For context, here’s our database configuration:

database_configuration = {
  storage        = 75
  logs           = 6
  commit_proxies = 6
  grv_proxies    = 4
  resolvers      = 2
}

The resolver CPU utilization swings are a big deal for us because the resolvers are really the fundamental limiting factor for our use case. If our cluster-wide write capacity can get cut in ~half by some weird roll of the dice, that’s a biiiiiig problem.

I haven’t yet found any discussion about how resolver placement within a cluster might affect CPU utilization, and am wondering if there’s an obvious cause for this behavior. Without much knowledge of a resolver’s internals, it’s not obvious to me how placement would affect CPU utilization in particular. So far, we’ve observed that the low-utilization configuration happened when the resolvers were both in one particular availability zone (let’s call that az-1-a), and the high-utilization configuration happened when both were in a different availability zone (az-1-f).

My first wild guess is that this could be a network latency thing; do resolvers benefit from proximity to some other process? My other wild guess is that there’s something different about the hardware in one AWS availability zone versus another, and I’ll investigate that in parallel. In the meantime, is this a familiar/expected phenomenon?

It looks like this is a classic “noisy neighbor” problem. In the cases where resolver processes have elevated CPU load, they share a Kubernetes node (emphasis on node—not pod or process!) with some other heavily-loaded storage or log process. It sounds like experimenting with CPU affinity or static CPU allocation might help—or just making sure resolvers (and proxies?) go on their own uncontested nodes. I’ll follow up if this turns out to be productive (although this current round of load testing is coming to a close, so follow-up results may happen later!).

In case it’s helpful to anybody else in the future, I cooked up a very simple Python script to help diagnose issues like these:

import json
import sys

if __name__ == "__main__":
    if len(sys.argv) != 2:
        raise SystemExit("Usage: describe-cluster-topology.py STATUS_JSON_FILE")

    with open(sys.argv[1]) as status_json_file:
        status_json = json.load(status_json_file)

    cluster = status_json["cluster"]

    if not cluster:
        raise ValueError("Status did not contain a 'cluster' section")

    machines = cluster["machines"]

    if not machines:
        raise ValueError("Cluster did not contain a 'machines' section")

    machines_by_az = {}

    for machine in machines:
        az = machines[machine]["locality"]["data_hall"]

        if az not in machines_by_az:
            machines_by_az[az] = []

        machines_by_az[az].append(machine)

    processes = cluster["processes"]

    if not processes:
        raise ValueError("Cluster did not contain a 'processes' section")

    processes_by_machine = {}
    roles_by_process = {}
    cpu_by_process = {}

    for process in processes:
        address = processes[process]["address"]
        machine_id = processes[process]["machine_id"]

        if machine_id not in processes_by_machine:
            processes_by_machine[machine_id] = []

        processes_by_machine[machine_id].append(address)

        cpu_by_process[address] = processes[process]["cpu"]["usage_cores"]

        for role in processes[process]["roles"]:
            if address not in roles_by_process:
                roles_by_process[address] = []

            roles_by_process[address].append(role["role"])

    for az in machines_by_az:
        print(f"- {az}")

        for machine in machines_by_az[az]:
            print(f"  - {machine}")

            # Some machines may have no processes
            if machine in processes_by_machine:
                for process in processes_by_machine[machine]:
                    cpu_utilization = cpu_by_process[process]

                    print(f"    - {process} (CPU: {cpu_utilization})")

                    # Some processes may have no roles
                    if process in roles_by_process:
                        for role in roles_by_process[process]:
                            print(f"      - {role}")