FoundationDB

getMachineLoad and calculation of CPUSeconds for MachineMetrics


(Jesse Bennett) #1

I’m trying to address an issue right now with the MachineMetrics coming out negative in FreeBSD. I found a couple of problems with the code I have so far (format is different, need to poll all cores) but just need to know if I’m looking at the right thing. From what I can tell testing in Docker and going through the code the CPUSeconds for MachineMetrics are calculated by this:

getMachineLoad(clockIdleTime, clockTotalTime);
returnStats.machineCPUSeconds = clockTotalTime - (*statState)->lastClockTotalTime != 0 ? ( 1 - ((clockIdleTime - (*statState)->lastClockIdleTime) / ((double)(clockTotalTime - (*statState)->lastClockTotalTime)))) * returnStats.elapsed : 0; 

Pulling this up in Docker for VSCode I can see on my VM 4 cores are polled and values are totaled:

I’m getting a negative 3 digit number as time progresses but those numbers as uint64_t do increment over time slightly since they are a metric of time. The test code which I use (which is now just on 1 core still for simplicity’s sake) to see this in a console application is as follows:

util = (100ULL * cur[CP_USER] / (sum ? (double)sum : 1.0L));
printf("cpu user utilization: %7.3f\n", util);
util = (100ULL * cur[CP_NICE] / (sum ? (double)sum : 1.0L));
printf("cpu nice utilization: %7.3f\n", util);
util = (100ULL * cur[CP_SYS] / (sum ? (double)sum : 1.0L));
printf("cpu sys utilization: %7.3f\n", util);
util = (100ULL * cur[CP_IDLE] / (sum ? (double)sum : 1.0L));
printf("cpu idle utilization: %7.3f\n", util);

printf("User in uint64_t: %ju\n", (uint64_t)(cur[CP_USER] * 100ULL));
printf("Nice in uint64_t: %ju\n", (uint64_t)(cur[CP_NICE] * 100ULL));
printf("System in uint64_t: %ju\n", (uint64_t)(cur[CP_SYS] * 100ULL));
printf("Idle in uint64_t: %ju\n", (uint64_t)(cur[CP_IDLE] * 100ULL));

The modifier at the end is just to make it equivalent to /proc/stat and there are some other methods to make them the same but I notice that the numbers are incrementing and “system” is too large- the output is:

cpu user utilization: 0.161
cpu nice utilization: 0.000
cpu sys utilization: 0.591
cpu idle utilization: 98.974
User in uint64_t: 28900
Nice in uint64_t: 0
System in uint64_t: 106100
Idle in uint64_t: 17782400

cpu user utilization: 0.161
cpu nice utilization: 0.000
cpu sys utilization: 0.591
cpu idle utilization: 98.974
User in uint64_t: 28900
Nice in uint64_t: 0
System in uint64_t: 106100
Idle in uint64_t: 17782400

cpu user utilization: 0.161
cpu nice utilization: 0.000
cpu sys utilization: 0.591
cpu idle utilization: 98.973
User in uint64_t: 29000
Nice in uint64_t: 0
System in uint64_t: 106200
Idle in uint64_t: 17796600

Alright, so you get the idea and working on fixing up the number conversion so it’s better but the ratios of the CPU are accurate and have definitely tested to see if that’s correct.

My question is- should CPUSeconds usually come out as 0 and how? The way I read the code is that it compares the time values and if they are different then output the calculated length of time between ProcessMetrics calls. Wouldn’t that always normally output a value? Also, is this the correct place to be testing things? I’ve tried modifying a few things here and there and seem to get the same negative 3 digit number.


(Jesse Bennett) #2

I believe I may have a possible cause of the issue:

So it looks like ticks and jiffies aren’t being converted properly. Also, different kernel versions have different 0 paddings applied to them:

The Docker image gives me this info for uname:

root@48ceb58f7a7c:/home/applefdbvol/bin# uname -a
Linux 48ceb58f7a7c 4.9.87-linuxkit-aufs #1 SMP Wed Mar 14 15:12:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux


(Jesse Bennett) #3

Ok, I solved it the cause of the problem was that Idle just wasn’t being added into the total but using ‘ticks’ just like Darwin does is fine with this equation after all. For this section of the log it’s reading:

Type="MachineMetrics" ID="0000000000000000" Elapsed="5.00392" MbpsSent="0.100129" MbpsReceived="0.133521" OutSegs="579" RetransSegs="0" CPUSeconds="0.110759" TotalMemory="8117676" CommittedMemory="1649936" AvailableMemory="6467740" ZoneID="57dcf28053018043d1374767f01da056" MachineID="57dcf28053018043d1374767f01da056" Machine="192.168.64.2:4689" LogGroup="default" TrackLatestType="Original"

Idle just got left out of the total so it was going into a large negative territory. There are some other things in the logs related to the recent huge allocations but that is for another forum topic.