Up till now, I used a local 4-process fdb cluster when running the tests, and I wanted to check the difference that going to the network introduce.
Unfortunately, I had to change multiple factors at the same time, so it is difficult to pin point what is going on, but the results where exactly the opposite to what I was expecting:
The client and local cluster run on this machine:
- i7-6700K @ 4.0 Ghz, 32 GB ram
- Windows 10, .NET 4.7.2
- FDB 5.1.7, .NET binding
- Samsung SM951 (MZHPV512HDGL)
I have also set up 2 VMs running 2 process each, on Ubuntu 18.4, hosted by Hyper-V, via 1Gbps ethernet in between for comparison.
The test simulates a fixed throughput (24,000 messages per minute) and measure the end-to-end latency of a message dispatch. It runs for 2 minutes. The request rate is fixed so I’m not running the cluster above its limits (I think).
Latency over time
The test running entirely localhost (Win -> Win):
The same test but using the remote cluster (Win -> Linux):
Looking at the distribution of latency for both, it is clear that:
- The linux server (and/or the network?) is more consistent with timings than the Windows server
- The Windows server implementation seems to be all other the place with timings.
Latency distributions
LOCALHOST (Win->Win)
- Min/Max: 27.7ms .. 566.2ms
- Average: 77.9ms (+/-29.2ms)
- Median : 73.3ms (+/-10.4ms)
- Distrib: (48.2ms) - 61.4ms =[ 73.3ms ]= 88.4ms - (118.2ms)
___________________________________________________________________________________________________________________________________
|____[ Min , Max )____|___Count____|__Percent____________________________________________________|___Cumulative________|__Weight____|
| 25.0ms - 30.0ms | 14 | 0.037% ` | 0.037% `--------- | 385.0ms |
| 30.0ms - 35.0ms | 62 | 0.162% . | 0.199% `--------- | 2.02s |
| 35.0ms - 40.0ms | 207 | 0.541% ; | 0.740% [--------- | 7.76s |
| 40.0ms - 45.0ms | 594 | 1.553% $ | 2.292% [--------- | 25.2s |
| 45.0ms - 50.0ms | 1,602 | 4.187% ##. | 6.480% )--------- | 76s |
| 50.0ms - 60.0ms | 5,889 | 15.392% #######X | 21.872% --[------- | 324s |
| 60.0ms - 70.0ms | 8,403 | 21.963% ########### | 43.835% ----(----- | 546s |
| 70.0ms - 80.0ms | 7,050 | 18.427% #########: | 62.262% ------[--- | 529s |
| 80.0ms - 90.0ms | 5,796 | 15.149% #######x | 77.412% -------)-- | 493s |
| 90.0ms - 100.0ms | 3,568 | 9.326% ####X | 86.738% --------)- | 339s |
| 100.0ms - 120.0ms | 3,479 | 9.093% ####= | 95.831% ---------) | 383s |
| 120.0ms - 140.0ms | 905 | 2.365% #: | 98.197% ---------] | 118s |
| 140.0ms - 160.0ms | 257 | 0.672% ; | 98.868% ---------] | 38.6s |
| 160.0ms - 180.0ms | 123 | 0.321% : | 99.190% ---------] | 20.9s |
| 180.0ms - 200.0ms | 64 | 0.167% . | 99.357% ---------] | 12.2s |
| 200.0ms - 250.0ms | 49 | 0.128% . | 99.485% ---------] | 11.0s |
| 250.0ms - 300.0ms | 91 | 0.238% . | 99.723% ---------> | 25.0s |
| 300.0ms - 350.0ms | 51 | 0.133% . | 99.856% ---------> | 16.6s |
| 350.0ms - 400.0ms | 26 | 0.068% ` | 99.924% ---------> | 9.75s |
| 400.0ms - 450.0ms | 6 | 0.016% ` | 99.940% ---------> | 2.55s |
| 450.0ms - 500.0ms | 7 | 0.018% ` | 99.958% ---------> | 3.33s |
| 500.0ms - 600.0ms | 16 | 0.042% ` | 100.000% ---------@ | 8.80s |
|_____________________|____________|_____________________________________________________________|_____________________|____________|
REMOTE (Win->Linux)
- Min/Max: 9.0ms .. 345.8ms
- Average: 49.0ms (+/-16.6ms)
- Median : 47.7ms (+/-7.3ms)
- Distrib: (31.3ms) - 38.2ms =[ 47.7ms ]= 56.6ms - (71.0ms)
___________________________________________________________________________________________________________________________________
|____[ Min , Max )____|___Count____|__Percent____________________________________________________|___Cumulative________|__Weight____|
| 8.0ms - 9.0ms | 1 | 0.003% ` | 0.003% `--------- | 8.5ms |
| 9.0ms - 10.0ms | 17 | 0.044% ` | 0.046% `--------- | 161.5ms |
| 10.0ms - 12.0ms | 57 | 0.147% . | 0.193% `--------- | 627.0ms |
| 12.0ms - 14.0ms | 70 | 0.180% . | 0.373% `--------- | 910.0ms |
| 14.0ms - 16.0ms | 43 | 0.111% . | 0.484% `--------- | 645.0ms |
| 16.0ms - 18.0ms | 52 | 0.134% . | 0.618% [--------- | 884.0ms |
| 18.0ms - 20.0ms | 47 | 0.121% . | 0.739% [--------- | 893.0ms |
| 20.0ms - 25.0ms | 210 | 0.541% ; | 1.279% [--------- | 4.73s |
| 25.0ms - 30.0ms | 358 | 0.922% = | 2.201% [--------- | 9.85s |
| 30.0ms - 35.0ms | 4,331 | 11.149% #####x | 13.350% -(-------- | 141s |
| 35.0ms - 40.0ms | 7,020 | 18.071% ######### | 31.421% ---[------ | 263s |
| 40.0ms - 45.0ms | 1,846 | 4.752% ##+ | 36.173% ---)------ | 78s |
| 45.0ms - 50.0ms | 10,091 | 25.976% ############# | 62.149% ------[--- | 479s |
| 50.0ms - 60.0ms | 7,561 | 19.464% #########X | 81.612% --------[- | 416s |
| 60.0ms - 70.0ms | 5,123 | 13.188% ######x | 94.800% ---------| | 333s |
| 70.0ms - 80.0ms | 785 | 2.021% # | 96.821% ---------) | 58.9s |
| 80.0ms - 90.0ms | 605 | 1.557% $ | 98.378% ---------] | 51.4s |
| 90.0ms - 100.0ms | 284 | 0.731% + | 99.109% ---------] | 27.0s |
| 100.0ms - 120.0ms | 177 | 0.456% : | 99.565% ---------> | 19.5s |
| 120.0ms - 140.0ms | 75 | 0.193% . | 99.758% ---------> | 9.75s |
| 140.0ms - 160.0ms | 40 | 0.103% . | 99.861% ---------> | 6.00s |
| 160.0ms - 180.0ms | 3 | 0.008% ` | 99.869% ---------> | 510.0ms |
| 250.0ms - 300.0ms | 35 | 0.090% ` | 99.959% ---------> | 9.63s |
| 300.0ms - 350.0ms | 16 | 0.041% ` | 100.000% ---------@ | 5.20s |
|_____________________|____________|_____________________________________________________________|_____________________|____________|
But then, looking at the peaks of the Linux test, at~30ms, ~45ms and ~60ms, it looks like it is always a multiple of 15ms.
My spider sense is tingling because 15ms is the default timeslice of the Windows scheduler (and system time resolution). I’ve seen so many times 15.x and 31ms come up in naive benchmarking using the system clock on Windows, that I always reject these values by default.
My tests are already using the QPC as the source, so the issue does not come from the benchmark framework, but maybe by some other component. There are other “media” application running so the OS timeslice should already be at 1000hz (1ms resolution for timers), but the system clock will still be at ~15ms resolution.
So:
- either this is a coincidence and 15ms seem to be a natural cycle length for my test
- or the client windows implementation uses the system clock somewhere as a source for timers?
Unfortunately the test app is not running on .NET Core yet, so I can’t replicate the tests with a Linux client to disprove this hypothesis…
Does anyone have any idea what would cause these N x 15ms latency modes with the Linux client?