Why doesn't my cluster performance scale when I double the number of machines?

I configured the cluster with resolvers=4 and re-run the same fdbserver test (3min, 50,000tps)

fdbtop:

ip               port    cpu%  mem%  iops     net    class        roles
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.28.174    4500    60    4     -        92     test
                  4501    60    3     -        93     test
                  4502    59    3     -        93     test
                  4503    59    3     -        93     test
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.32.157    4500    69    7     20194    21     storage      storage
                  4501    74    6     20211    19     storage      storage
                  4502    0     3     -        0      stateless
                  4503    1     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.32.74     4500    66    13    1542     210    log          log
                  4501    0     4     -        0      stateless
                  4502    0     3     -        0      stateless
                  4503    16    4     -        10     stateless    cluster_controller
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.33.171    4500    67    15    16682    19     storage      storage
                  4501    70    18    16682    19     storage      storage
                  4502    0     3     -        0      stateless
                  4503    1     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.33.172    4500    76    21    17100    22     storage      storage
                  4501    72    18    17124    22     storage      storage
                  4502    1     3     -        0      stateless
                  4503    1     2     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.34.155    4500    50    19    15794    5      storage      storage
                  4501    49    18    15793    5      storage      storage
                  4502    0     3     -        0      stateless
                  4503    1     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.35.133    4500    89    10    23528    10     storage      storage
                  4501    70    7     23528    7      storage      storage
                  4502    46    3     -        152    proxy        proxy
                  4503    0     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.36.35     4500    88    9     2529     283    log          log
                  4501    0     4     -        0      stateless
                  4502    0     3     -        0      stateless
                  4503    0     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.37.131    4500    59    20    16143    10     storage      storage
                  4501    52    18    16078    12     storage      storage
                  4502    2     5     -        0      stateless
                  4503    1     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.37.98     4500    91    9     16941    20     storage      storage
                  4501    80    6     16932    15     storage      storage
                  4502    80    3     -        265    proxy        proxy
                  4503    0     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.38.195    4500    76    8     19611    21     storage      storage
                  4501    71    6     19624    20     storage      storage
                  4502    0     3     -        0      stateless
                  4503    0     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.38.34     4500    94    9     4279     283    log          log
                  4501    55    5     -        53     stateless    resolver
                  4502    0     3     -        0      stateless
                  4503    0     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.39.157    4500    78    8     17692    13     storage      storage
                  4501    77    6     17701    13     storage      storage
                  4502    21    3     -        4      stateless    master
                  4503    41    3     -        29     stateless    resolver
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.39.184    4500    82    8     19657    21     storage      storage
                  4501    77    6     19785    21     storage      storage
                  4502    0     2     -        0      stateless
                  4503    0     2     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.39.85     4500    37    10    670      71     log          log
                  4501    1     4     -        0      stateless
                  4502    1     2     -        0      stateless
                  4503    1     2     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.40.18     4500    58    8     18546    11     storage      storage
                  4501    73    6     18554    12     storage      storage
                  4502    0     3     -        0      stateless
                  4503    28    3     -        26     stateless    resolver
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.42.96     4500    43    19    16580    4      storage      storage
                  4501    48    20    16589    4      storage      storage
                  4502    77    7     -        271    proxy        proxy
                  4503    1     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.44.149    4500    55    18    17122    10     storage      storage
                  4501    64    20    17122    10     storage      storage
                  4502    69    7     -        227    proxy        proxy
                  4503    1     2     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.46.120    4500    70    16    16762    17     storage      storage
                  4501    82    15    16785    21     storage      storage
                  4502    46    3     -        40     stateless    resolver
                  4503    0     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.47.158    4500    74    7     17540    15     storage      storage
                  4501    64    6     17535    16     storage      storage
                  4502    0     3     -        0      stateless
                  4503    0     3     -        0      stateless
---------------  ------  ----  ----  -------  -----  -----------  --------------------
 172.31.47.4      4500    61    24    14587    17     storage      storage
                  4501    65    19    14551    17     storage      storage
                  4502    1     3     -        0      stateless
                  4503    2     12    -        0      stateless

Results:

setting up test (Benchmark)...
running test...
Benchmark complete
checking tests...
fetching metrics...
Metric (0, 0): Measured Duration, 135.000000, 135
Metric (0, 1): Transactions/sec, 12494.688889, 1.25e+04
Metric (0, 2): Operations/sec, 74968.133333, 7.5e+04
Metric (0, 3): A Transactions, 1686783.000000, 1686783
Metric (0, 4): B Transactions, 0.000000, 0
Metric (0, 5): Retries, 70932.000000, 70932
Metric (0, 6): Mean load time (seconds), 0.000000, 0
Metric (0, 7): Read rows, 1686783.000000, 1.69e+06
Metric (0, 8): Write rows, 8433915.000000, 8.43e+06
Metric (0, 9): Mean Latency (ms), 24.433103, 24.4
Metric (0, 10): Median Latency (ms, averaged), 22.470236, 22.5
Metric (0, 11): 90% Latency (ms, averaged), 30.676603, 30.7
Metric (0, 12): 98% Latency (ms, averaged), 53.017616, 53
Metric (0, 13): Max Latency (ms, averaged), 228.917837, 229
Metric (0, 14): Mean Row Read Latency (ms), 5.856853, 5.86
Metric (0, 15): Median Row Read Latency (ms, averaged), 5.380630, 5.38
Metric (0, 16): Max Row Read Latency (ms, averaged), 187.579632, 188
Metric (0, 17): Mean Total Read Latency (ms), 5.803291, 5.8
Metric (0, 18): Median Total Read Latency (ms, averaged), 5.342722, 5.34
Metric (0, 19): Max Total Latency (ms, averaged), 187.579632, 188
Metric (0, 20): Mean GRV Latency (ms), 7.503022, 7.5
Metric (0, 21): Median GRV Latency (ms, averaged), 7.063389, 7.06
Metric (0, 22): Max GRV Latency (ms, averaged), 35.660267, 35.7
Metric (0, 23): Mean Commit Latency (ms), 9.893580, 9.89
Metric (0, 24): Median Commit Latency (ms, averaged), 9.062767, 9.06
Metric (0, 25): Max Commit Latency (ms, averaged), 55.891752, 55.9
Metric (0, 26): Read rows/sec, 12494.688889, 1.25e+04
Metric (0, 27): Write rows/sec, 62473.444444, 6.25e+04
Metric (0, 28): Bytes read/sec, 1399405.155556, 1.4e+06
Metric (0, 29): Bytes written/sec, 6997025.777778, 7e+06
Metric (1, 0): Measured Duration, 135.000000, 135
Metric (1, 1): Transactions/sec, 12491.244444, 1.25e+04
Metric (1, 2): Operations/sec, 74947.466667, 7.49e+04
Metric (1, 3): A Transactions, 1686318.000000, 1686318
Metric (1, 4): B Transactions, 0.000000, 0
Metric (1, 5): Retries, 73017.000000, 73017
Metric (1, 6): Mean load time (seconds), 0.000000, 0
Metric (1, 7): Read rows, 1686318.000000, 1.69e+06
Metric (1, 8): Write rows, 8431590.000000, 8.43e+06
Metric (1, 9): Mean Latency (ms), 25.181469, 25.2
Metric (1, 10): Median Latency (ms, averaged), 23.118734, 23.1
Metric (1, 11): 90% Latency (ms, averaged), 31.497478, 31.5
Metric (1, 12): 98% Latency (ms, averaged), 54.416656, 54.4
Metric (1, 13): Max Latency (ms, averaged), 315.679073, 316
Metric (1, 14): Mean Row Read Latency (ms), 6.087186, 6.09
Metric (1, 15): Median Row Read Latency (ms, averaged), 5.635738, 5.64
Metric (1, 16): Max Row Read Latency (ms, averaged), 101.420164, 101
Metric (1, 17): Mean Total Read Latency (ms), 6.057966, 6.06
Metric (1, 18): Median Total Read Latency (ms, averaged), 5.623817, 5.62
Metric (1, 19): Max Total Latency (ms, averaged), 101.420164, 101
Metric (1, 20): Mean GRV Latency (ms), 7.647929, 7.65
Metric (1, 21): Median GRV Latency (ms, averaged), 7.230759, 7.23
Metric (1, 22): Max GRV Latency (ms, averaged), 33.132792, 33.1
Metric (1, 23): Mean Commit Latency (ms), 10.048114, 10
Metric (1, 24): Median Commit Latency (ms, averaged), 9.285212, 9.29
Metric (1, 25): Max Commit Latency (ms, averaged), 49.364805, 49.4
Metric (1, 26): Read rows/sec, 12491.244444, 1.25e+04
Metric (1, 27): Write rows/sec, 62456.222222, 6.25e+04
Metric (1, 28): Bytes read/sec, 1399019.377778, 1.4e+06
Metric (1, 29): Bytes written/sec, 6995096.888889, 7e+06
Metric (2, 0): Measured Duration, 135.000000, 135
Metric (2, 1): Transactions/sec, 12489.896296, 1.25e+04
Metric (2, 2): Operations/sec, 74939.377778, 7.49e+04
Metric (2, 3): A Transactions, 1686136.000000, 1686136
Metric (2, 4): B Transactions, 0.000000, 0
Metric (2, 5): Retries, 70142.000000, 70142
Metric (2, 6): Mean load time (seconds), 0.000000, 0
Metric (2, 7): Read rows, 1686136.000000, 1.69e+06
Metric (2, 8): Write rows, 8430680.000000, 8.43e+06
Metric (2, 9): Mean Latency (ms), 23.809055, 23.8
Metric (2, 10): Median Latency (ms, averaged), 22.060156, 22.1
Metric (2, 11): 90% Latency (ms, averaged), 29.850006, 29.9
Metric (2, 12): 98% Latency (ms, averaged), 50.099850, 50.1
Metric (2, 13): Max Latency (ms, averaged), 235.465765, 235
Metric (2, 14): Mean Row Read Latency (ms), 5.736473, 5.74
Metric (2, 15): Median Row Read Latency (ms, averaged), 5.322695, 5.32
Metric (2, 16): Max Row Read Latency (ms, averaged), 177.443027, 177
Metric (2, 17): Mean Total Read Latency (ms), 5.724418, 5.72
Metric (2, 18): Median Total Read Latency (ms, averaged), 5.309343, 5.31
Metric (2, 19): Max Total Latency (ms, averaged), 177.443027, 177
Metric (2, 20): Mean GRV Latency (ms), 7.315672, 7.32
Metric (2, 21): Median GRV Latency (ms, averaged), 6.891727, 6.89
Metric (2, 22): Max GRV Latency (ms, averaged), 36.601305, 36.6
Metric (2, 23): Mean Commit Latency (ms), 9.680379, 9.68
Metric (2, 24): Median Commit Latency (ms, averaged), 8.915186, 8.92
Metric (2, 25): Max Commit Latency (ms, averaged), 54.644346, 54.6
Metric (2, 26): Read rows/sec, 12489.896296, 1.25e+04
Metric (2, 27): Write rows/sec, 62449.481481, 6.24e+04
Metric (2, 28): Bytes read/sec, 1398868.385185, 1.4e+06
Metric (2, 29): Bytes written/sec, 6994341.925926, 6.99e+06
Metric (3, 0): Measured Duration, 135.000000, 135
Metric (3, 1): Transactions/sec, 12506.903704, 1.25e+04
Metric (3, 2): Operations/sec, 75041.422222, 7.5e+04
Metric (3, 3): A Transactions, 1688432.000000, 1688432
Metric (3, 4): B Transactions, 0.000000, 0
Metric (3, 5): Retries, 71062.000000, 71062
Metric (3, 6): Mean load time (seconds), 0.000000, 0
Metric (3, 7): Read rows, 1688432.000000, 1.69e+06
Metric (3, 8): Write rows, 8442160.000000, 8.44e+06
Metric (3, 9): Mean Latency (ms), 24.029755, 24
Metric (3, 10): Median Latency (ms, averaged), 22.102594, 22.1
Metric (3, 11): 90% Latency (ms, averaged), 30.331612, 30.3
Metric (3, 12): 98% Latency (ms, averaged), 51.876068, 51.9
Metric (3, 13): Max Latency (ms, averaged), 296.543598, 297
Metric (3, 14): Mean Row Read Latency (ms), 5.882259, 5.88
Metric (3, 15): Median Row Read Latency (ms, averaged), 5.442858, 5.44
Metric (3, 16): Max Row Read Latency (ms, averaged), 228.428364, 228
Metric (3, 17): Mean Total Read Latency (ms), 5.902114, 5.9
Metric (3, 18): Median Total Read Latency (ms, averaged), 5.451441, 5.45
Metric (3, 19): Max Total Latency (ms, averaged), 228.428364, 228
Metric (3, 20): Mean GRV Latency (ms), 7.293286, 7.29
Metric (3, 21): Median GRV Latency (ms, averaged), 6.917238, 6.92
Metric (3, 22): Max GRV Latency (ms, averaged), 34.911633, 34.9
Metric (3, 23): Mean Commit Latency (ms), 9.678787, 9.68
Metric (3, 24): Median Commit Latency (ms, averaged), 8.921623, 8.92
Metric (3, 25): Max Commit Latency (ms, averaged), 54.420233, 54.4
Metric (3, 26): Read rows/sec, 12506.903704, 1.25e+04
Metric (3, 27): Write rows/sec, 62534.518519, 6.25e+04
Metric (3, 28): Bytes read/sec, 1400773.214815, 1.4e+06
Metric (3, 29): Bytes written/sec, 7003866.074074, 7e+06
4 test clients passed; 0 test clients failed

BEAUTY!
tps: 12,500x4=50,000!
commit latency: 10ms!

So it was the resolvers that were limiting the transaction flow it would appear. I can understand how the resolver can become a bottleneck but do you mind confirming that its impact is as important as that?

Link to status json dump (after the test was run, more useful as a reference for IP/IDs): status json

I still got 4% of conflicts, so I’ll try to increase the resolvers to 8 (double the number of log processes) and see if that can improve the results…

It’s been fun :slight_smile: thank you