Coordinator performance

Hi folks!
I’m running fdb cluster on 30 servers, 96 cpu/workers each. I noticed that quite a lot of time coordinator worker is pegged at 100% cpu and after profiling I see the ClusterGetStatusActor is consuming the majority of CPU cycles

Multiple concurrent calls to latestEventOnWorkers() for different event types:

MachineMetrics

ProcessMetrics

NetworkMetrics

TraceFileOpenError

ProgramStart

etc..

I noticed that it can be tuned with server knobs like status_min_time_between_requests, but there are no recommendations (Or I might be reading documentation wrong)

So the question is - what would be the recommendation?

No great recommendation (from me, anyway). AI found this, where we document O(500) cores/processes: Known Limitations — FoundationDB ON documentation

That said if you can provide a more detailed profile maybe there is something we can do here.

1 Like