Significant changes in CPU load on resolver processes depending on placement in a cluster

jon · March 7, 2025, 4:19pm

It looks like this is a classic “noisy neighbor” problem. In the cases where resolver processes have elevated CPU load, they share a Kubernetes node (emphasis on node—not pod or process!) with some other heavily-loaded storage or log process. It sounds like experimenting with CPU affinity or static CPU allocation might help—or just making sure resolvers (and proxies?) go on their own uncontested nodes. I’ll follow up if this turns out to be productive (although this current round of load testing is coming to a close, so follow-up results may happen later!).

In case it’s helpful to anybody else in the future, I cooked up a very simple Python script to help diagnose issues like these:

import json
import sys

if __name__ == "__main__":
    if len(sys.argv) != 2:
        raise SystemExit("Usage: describe-cluster-topology.py STATUS_JSON_FILE")

    with open(sys.argv[1]) as status_json_file:
        status_json = json.load(status_json_file)

    cluster = status_json["cluster"]

    if not cluster:
        raise ValueError("Status did not contain a 'cluster' section")

    machines = cluster["machines"]

    if not machines:
        raise ValueError("Cluster did not contain a 'machines' section")

    machines_by_az = {}

    for machine in machines:
        az = machines[machine]["locality"]["data_hall"]

        if az not in machines_by_az:
            machines_by_az[az] = []

        machines_by_az[az].append(machine)

    processes = cluster["processes"]

    if not processes:
        raise ValueError("Cluster did not contain a 'processes' section")

    processes_by_machine = {}
    roles_by_process = {}
    cpu_by_process = {}

    for process in processes:
        address = processes[process]["address"]
        machine_id = processes[process]["machine_id"]

        if machine_id not in processes_by_machine:
            processes_by_machine[machine_id] = []

        processes_by_machine[machine_id].append(address)

        cpu_by_process[address] = processes[process]["cpu"]["usage_cores"]

        for role in processes[process]["roles"]:
            if address not in roles_by_process:
                roles_by_process[address] = []

            roles_by_process[address].append(role["role"])

    for az in machines_by_az:
        print(f"- {az}")

        for machine in machines_by_az[az]:
            print(f"  - {machine}")

            # Some machines may have no processes
            if machine in processes_by_machine:
                for process in processes_by_machine[machine]:
                    cpu_utilization = cpu_by_process[process]

                    print(f"    - {process} (CPU: {cpu_utilization})")

                    # Some processes may have no roles
                    if process in roles_by_process:
                        for role in roles_by_process[process]:
                            print(f"      - {role}")

Topic		Replies	Views
Adding a resolver causes cluster to become non-reconciled Kubernetes Operator performance , operator	4	545	October 29, 2021
Configuring FoundationDB to Use More Than One Resolver Using FoundationDB performance	23	2616	August 21, 2019
CPU limited storage processes Using FoundationDB performance	9	1627	May 18, 2021
Cluster tuning cookbook Using FoundationDB	26	8952	February 1, 2019
FDB 6.2 - proxies processes have 100% CPU usage Running FoundationDB performance	1	411	October 10, 2022

Significant changes in CPU load on resolver processes depending on placement in a cluster

Related topics