Significant changes in CPU load on resolver processes depending on placement in a cluster

It looks like this is a classic “noisy neighbor” problem. In the cases where resolver processes have elevated CPU load, they share a Kubernetes node (emphasis on node—not pod or process!) with some other heavily-loaded storage or log process. It sounds like experimenting with CPU affinity or static CPU allocation might help—or just making sure resolvers (and proxies?) go on their own uncontested nodes. I’ll follow up if this turns out to be productive (although this current round of load testing is coming to a close, so follow-up results may happen later!).

In case it’s helpful to anybody else in the future, I cooked up a very simple Python script to help diagnose issues like these:

import json
import sys

if __name__ == "__main__":
    if len(sys.argv) != 2:
        raise SystemExit("Usage: describe-cluster-topology.py STATUS_JSON_FILE")

    with open(sys.argv[1]) as status_json_file:
        status_json = json.load(status_json_file)

    cluster = status_json["cluster"]

    if not cluster:
        raise ValueError("Status did not contain a 'cluster' section")

    machines = cluster["machines"]

    if not machines:
        raise ValueError("Cluster did not contain a 'machines' section")

    machines_by_az = {}

    for machine in machines:
        az = machines[machine]["locality"]["data_hall"]

        if az not in machines_by_az:
            machines_by_az[az] = []

        machines_by_az[az].append(machine)

    processes = cluster["processes"]

    if not processes:
        raise ValueError("Cluster did not contain a 'processes' section")

    processes_by_machine = {}
    roles_by_process = {}
    cpu_by_process = {}

    for process in processes:
        address = processes[process]["address"]
        machine_id = processes[process]["machine_id"]

        if machine_id not in processes_by_machine:
            processes_by_machine[machine_id] = []

        processes_by_machine[machine_id].append(address)

        cpu_by_process[address] = processes[process]["cpu"]["usage_cores"]

        for role in processes[process]["roles"]:
            if address not in roles_by_process:
                roles_by_process[address] = []

            roles_by_process[address].append(role["role"])

    for az in machines_by_az:
        print(f"- {az}")

        for machine in machines_by_az[az]:
            print(f"  - {machine}")

            # Some machines may have no processes
            if machine in processes_by_machine:
                for process in processes_by_machine[machine]:
                    cpu_utilization = cpu_by_process[process]

                    print(f"    - {process} (CPU: {cpu_utilization})")

                    # Some processes may have no roles
                    if process in roles_by_process:
                        for role in roles_by_process[process]:
                            print(f"      - {role}")