Cluster tuning cookbook

shamatar · June 19, 2018, 2:35pm

Hello everyone.

I’m try to properly configure a cluster and struggling with the following behavior:

There are 8 machines, each is 4 cores, 8GB of memory
Each machine runs one FDB instance
Engine is double ssd
There are 5 coordinators in total

Right now the transaction speed is just a little above 11k per second. Each transaction is 4 reads and 2 writes (non-atomics), key is around 64 bytes, value is less than 100 bytes. Looks like it’s quite underperforming and I’d appreciate some advice why.

Should I increase the number of instances to 1 per CPU core?
If I increase number of instances should I also guarantee some IOPS for each process?
What would be suggested number of resolvers/proxies/logs and other roles for such cluster?

Sincerely, Alexander

shamatar · June 19, 2018, 3:54pm

Additional information:

test parameters:

; This file is used by the Atlas Load Generator
testTitle=RandomReadWriteTest
testName=ReadWrite
testDuration=10.0
transactionsPerSecond=100000
writesPerTransactionA=2
readsPerTransactionA=1
; Transactions A/B
alpha=0
; Produces 1GB
nodeCount=100000
valueBytes=100
; average 600
minValueBytes=80
discardEdgeMeasurements=false
warmingDelay=20.0
timeout=300000.0
databasePingDelay=300000.0

fdbserver -r multitest -f /tmp/testData --num_testers 1
setting up test (RandomReadWriteTest)...
running test...
RandomReadWriteTest complete
checking tests...
fetching metrics...
Metric (0, 0): Measured Duration, 10.000000, 10
Metric (0, 1): Transactions/sec, 8859.800000, 8.86e+03
Metric (0, 2): Operations/sec, 26579.400000, 2.66e+04
Metric (0, 3): A Transactions, 88598.000000, 88598
Metric (0, 4): B Transactions, 0.000000, 0
Metric (0, 5): Retries, 26556.000000, 26556
Metric (0, 6): Mean load time (seconds), 1.399409, 1.4
Metric (0, 7): Read rows, 88598.000000, 8.86e+04
Metric (0, 8): Write rows, 177196.000000, 1.77e+05
Metric (0, 9): Mean Latency (ms), 2125.749589, 2.13e+03
Metric (0, 10): Median Latency (ms, averaged), 1974.245071, 1.97e+03
Metric (0, 11): 90% Latency (ms, averaged), 4956.706047, 4.96e+03
Metric (0, 12): 98% Latency (ms, averaged), 6177.048922, 6.18e+03
Metric (0, 13): Max Latency (ms, averaged), 9932.500124, 9.93e+03
Metric (0, 14): Mean Row Read Latency (ms), 717.623410, 718
Metric (0, 15): Median Row Read Latency (ms, averaged), 966.956854, 967
Metric (0, 16): Max Row Read Latency (ms, averaged), 1134.870291, 1.13e+03
Metric (0, 17): Mean Total Read Latency (ms), 712.500015, 713
Metric (0, 18): Median Total Read Latency (ms, averaged), 967.903614, 968
Metric (0, 19): Max Total Latency (ms, averaged), 1134.870291, 1.13e+03
Metric (0, 20): Mean GRV Latency (ms), 909.371216, 909
Metric (0, 21): Median GRV Latency (ms, averaged), 986.372471, 986
Metric (0, 22): Max GRV Latency (ms, averaged), 1154.510498, 1.15e+03
Metric (0, 23): Mean Commit Latency (ms), 19.985437, 20
Metric (0, 24): Median Commit Latency (ms, averaged), 7.279873, 7.28
Metric (0, 25): Max Commit Latency (ms, averaged), 141.917706, 142
Metric (0, 26): Read rows/sec, 8859.800000, 8.86e+03
Metric (0, 27): Write rows/sec, 17719.600000, 1.77e+04
Metric (0, 28): Bytes read/sec, 939138.800000, 9.39e+05
Metric (0, 29): Bytes written/sec, 1878277.600000, 1.88e+06
1 test clients passed; 0 test clients failed

1 tests passed; 0 tests failed, waiting for DD to end...

ajbeamon · June 20, 2018, 8:40pm

Is this potentially saturating the client? What happens if you run multiple clients?

shamatar · June 21, 2018, 10:34am

Right now the system simulates some production transactions. If the transaction itself is commented-out (only the transaction, not the preparation steps), it gives 50k+ operations per second.

For details: system itself is written in GO and uses 16 hyperthreading cores (GOMAXPROC=16). It has one global fdb.Database object (that may be a problem, didn’t find a clear answer if such object is single-threaded inside).

ajbeamon · June 21, 2018, 11:30pm

Are you saying that your test is a Go program, or is it the multitest run from above?

In either case, each client process has a single network thread, and it’s possible to saturate that thread. I don’t know if that’s happening to you here, but in order to fully take advantage of a cluster it’s usually necessary to run more than one client process to drive load. To do that, you’ll actually need to start multiple client processes, as we don’t support running multiple network threads in a process.

See also the discussion here: Benchmarking FoundationDB on AWS - #4 by ajbeamon

shamatar · June 22, 2018, 5:58pm

I’ve tried running two simulators in parallel too, each would give the half of the maximum 11k/second speed.

Is there any way to suggest a FoundationDB how to shard the keys being written? Our key structure right now is quite simple, it’s 4 byte common prefix and than almost uniform distribution.

Also, each fdbserver process can serve different roles (log, proxy, etc). Is it suggested to have one role per instance?

alexmiller · June 22, 2018, 10:47pm

Each fdbserver process is single-threaded, so it’s generally a good idea to run as many fdbserver processes as roles that you’ll end up recruiting (and extra fdbserver processes will end up just being idle). I’d suggest only pointing one fdbserver process per machine to the disk that you have though. (Unless the one process isn’t enough to saturate your disk, which is sometimes the case. Then point two per machine to the disk.)

Recruitment isn’t always the best at assigning roles in a way that benefit performance the most. (Partly due to lack of intelligence, partly due to lack of topology information.). For our own performance tests, we explicitly assign each process a process class that’s restrictive enough the strongly guide recruitment to a good recruitment.

In particular, I’d recommend making sure that…
(1) Each proxy is running on a different host
(2) The transaction log isn’t sharing a disk with a storage server
(3) Assign an explicit process to server as the cluster controller, and another to serve as the master

As you have eight hosts, you can probably lay out a process configuration that looks like:

1: proxy storage
2: proxy storage
3: resolution storage
4: log
5: log
6: storage cluster_controller
7: storage master
8: storage

Assuming that you’re targeting a proxies=2 logs=2 config.

The goal of this layout is to consider the network bandwidth limits that are available, and maximizing the possible throughput. Commits come in to the proxies and flow to the transaction logs. By having a dedicated network link for both proxies and both logs, we make sure we haven’t accidentally cut our possible bandwidth in half. By assigning explicit process classes for each role, we make sure that we aren’t competing with another role recruited in the same process for CPU time, which would noticeably increase latency. Assigning only one storage role per disk means that we aren’t competing for IO bandwidth (or fsync’s).

I’d suggest looking at the total CPU, network, and disk usage on the relevant machines during your test to give you a hint on how to further tweak the best logs and proxies setting for your config. It might make sense to run 2 proxies on a host if your one proxy process is very high on CPU usage, but not fully utilizing the network, or running an extra log might turn out to increase your throughput.

KrzysFR · June 23, 2018, 11:38am

That’s very interesting.

I have a few questions as well

Q1: What happens if host 6 (from the example above) goes down? Will the cluster_controller role be automatically assigned to another host? But if all the other others nodes either want to be proxy, log or storage, will the cluster_controller role be assigned to one of them anyway? And if host 6 goes back online, will it then take over the cluster_controller role immediately ? Or will it stick to the other host until it goes down, before returning to the preferred host 6?

Q2: If a process is only assigned CPU oriented roles like proxy, cluster_controller or master, can we safely reduce their memory usage down from 4GB to something like 2GB ?

I have a few servers that have a RAM to core ratio that is somewhere between 1 GB to 2GB (due to the current high pricing for RAM modules). This means that I’m left with a lot of unused CPU cores that could help share the load.

So, for example, I have 8 cores and 16 GB per server, and currently I can only assign 4 process with 4GB each and will only use halt the cores.

I could change the layout to have 2 fdbserver process with 4 GB each assigned to storage-oriented roles, and 4 more process with 2GB each for cpu-oriented roles. This would still use the same 16 GB total, but now I’m using 6 out of 8 cores, which looks like a better utilization to me.

Q3: In your example above, you list one or more roles per host, but I assume that they will run more than one fdbserver process. Should we also assign specific roles to each process manually? For exeaple for host 2, should I assign the proxy class to process 4500, and storage class to process 4501? What If I have 4 process per host, but only have 2 roles to assign ?

Q4: When in doubt, is it better to assign more proxy roles to extra process (ex: 4 process on a machine with only two disks, the extra 2 could serve as proxy?), or is there a thing as too many proxies for a certain size of cluster?

Q5: is it possible to configure a process to reject a specific role, instead of assigning one? Meaning blacklisting certain roles for some process, but letting it be free to choose any other role? I’m looking at for example banning log or storage class all 3rd or 4th process on each host, if I only have two disks (so that 1st and 2nd process each can have one, but the other 2 process cannot use them).

shamatar · June 25, 2018, 12:57am

I’ve done additional test by making parallel requests to two clusters of FoundationDB (each giving 10k transactions per second if used as the only cluster). And the total result is the same 10k transactions per second. Looks like there is a network congestion. I use Google Cloud Compute Engine that should provide 1GB/s channels, and FoundationDBs are colocated in the same data center and private network. Maximum outgoing traffic is 115 Mbit/s, much below the provided bandwidth.

Does anyone know how does FDB client process work? If it’s also single-threaded it can be a limiting factor for my case.

Sincerely, Alex

KrzysFR · June 25, 2018, 6:55am

It is single threaded.

shamatar · June 25, 2018, 11:35am

May be I should clarify this a little. At least with GO bindings an fdb client starts a “networking” by calling some C function. If such C networking process is single threaded, than yes, I have a transaction sending congestion and starting separate GO processes each with it’s own C “networking” will solve an issue. Although if this C “networking” process is actually some system-wide started daemon (highly unlikely, but still), than starting more virtual machines will solve my problem. I’ll experiment later today and report, didn’t dig into sources yet.

KrzysFR · June 25, 2018, 1:50pm

Each process will spawn its own networking thread which is unique per process (1). The role of the networking thread is to talk to the database, and also execute the future callbacks when they become ready.

I think that the go binding schedules the continuation of asynchronous operations on a different thread than the fdb network thread, but it can still become saturated just by the cost of executing all the callbacks, especially if you are doing a lot small single-key reads concurrently. Maybe changing your algorithm to use more range reads could potentially reduce the callback overhead?

Spawning multiple GO processes will help you get more throughput because you will have multiple network threads running concurrently, utilizing more of your CPU cores.

(1) you can have multiple thread per process when running the multi-version client, but this is only true while you are in the process of upgrading the database to a new version. In normal operation, only one will be really active.

alexmiller · June 25, 2018, 5:58pm

It’s a preference, but not a requirement. A process with the process class of cluster_controller is viewed as a the best location for a cluster_controller role, but if one isn’t available, then any stateless process will do. If a process with cluster_controller process class joins the cluster, then we’ll notice that there’s a better location, and begin a recovery to re-recruit everything in a better location. (I believe this might not actually be true specifically for cluster_controller, but is true for everything else. You’d be looking for betterMasterExists() in the code.)

Depending on the role… maybe, and it’ll depend heavily by role, and partly on your workload. I think that after #432, it’d be safer to reduce the memory on proxies, as long as you set SERVER_MEM_LIMIT accordingly. I don’t think I have specific advice here, you’d more have to look at usage stats and see.

There’s also knobs to reduce the amount of cache that a storage server holds, if you’d like to try reducing memory on storage servers.

More than one fdbserver process was my intention, yes. If you don’t specify a role, then it gets the “Unset” role, which is viewed as an okay place to put anything.

Upon further thought, I’m actually not sure that everything that has an “Unset” role wouldn’t be picked up by data distribution to be recruited as a storage server. I’ll try to remember to check this later today.

GetReadVersion, which is done by a client at the beginning of every transaction, involves a proxy talking to every other proxy to find the maximum committed version. So there’s definitely a point where adding more proxies slows things down, rather than speeding things up.

In general no, but in that specific case, you can assign the stateless role to a process, which then means we prefer not assigning anything that involves durable storage (logs and storage servers) to that process.

… I had actually thought there was a way to start up fdbserver and not point it at a disk as a data location, which would automatically configure it into a stateless process, but I appear to have been wrong.

ajbeamon · June 25, 2018, 6:12pm

I think it’s a bit stronger than this, as the stateless role cannot run logs or storage nodes.

Running more proxies than necessary may have downsides, but running more nodes with the proxy (or, perhaps better, stateless) class shouldn’t be a problem. For example, you could specify in your configuration that you desire 5 proxies and run 8 proxy class processes. Then, if one of your proxy class processes dies, you’ll have 3 good candidates for a replacement to go. If you run only 5 proxy class processes, then when a proxy dies you could either end up with 4 proxies or a 5th proxy on a process with a different class, depending on what classes you are using in your cluster.

KrzysFR · June 26, 2018, 11:31am

Thank you, this makes it a lot clearer for me.

Regarding the stateless role, what happen if I have more process with stateless class, than I need for a given configuration? I run a configuration that requires 5 proxy, but I have 10 process with class stateless, what will the extra process do if all the other roles are assigned ? Will this processes just stay idle, waiting to be recruited in case a proxy fails?

ajbeamon · June 26, 2018, 5:44pm

The stateless processes can be a destination for several roles, including the master, proxies, cluster controller, and resolvers. But yes, if you have more stateless processes than required for your configuration, some of them will be assigned no roles and will essentially be idle until needed.

alexmiller · June 30, 2018, 8:24pm

CC: @amirouche, as I think this is relevant to your “how do I run faster on EC2” thread as well.

After talking with Evan, I’ve come to understand that recruitment is suboptimal if you’re running all of your processes without specifying a process class. Data Distribution will recruit storage servers on everything that it can, so every process will become a storage server in addition to whatever other roles that it’s assigned. This means that:

Storage servers will run in the same process as a proxy, causing higher latencies for all proxy operations.
Storage servers will run against the same disk as a transaction log, making the two compete for disk bandwidth and sync operations.
Many storage servers will run against the same disk, resulting in an overall slower throughput for each storage server.

The solution here is to assign process classes:

Mark all but two processes per disk as stateless
Reserve as many disks as logs= in the configuration, and mark one process per disk as a transaction log, (with other being stateless)

KrzysFR · July 2, 2018, 1:49pm

I have a hard time finding the corresponding knobs, and also how to configure them from within foundationdb.conf. I see the “–memory” argument for fdbserver but this seems like a memory threshold before killing the process.

For example, If I want to limit a stateless class process to 2 GiB, what are the knobs I need to change, and how to I set their values from foundationdb.conf?

If I only set memory = 2GiB, I think that it will always try to consume 4GB and be repeatedly killed when it goes above 2GB, which is not what I’d like (I’d prefer that it attempt to only use 2GB, and maybe kill it if it goes above 4GB.).

alexmiller · July 2, 2018, 7:18pm

Hm. Apparently, we don’t have any form of --help that lists all the knobs you can set, and it takes --dev-help for it to even print that it’s possible to set knobs.

You’re looking for --knob_server_mem_limit=, so you should be able to add knob_server_mem_limit = <2GB as an integer> to your foundationdb.conf, and it should work. Take a look at the various Knobs.h files (e.g. fdbserver/Knobs.h) for the full listings.

The storage server page cache knob is --knob_page_cache_4k, that describes the total number of bytes that should be held in the page cache (so divide by 4k to get the number of pages). It turns out that it defaults to 200MB, so it’s likely not going to be helpful to tweak down. (Though we should probably set the default higher in this day and age…)

SERVER_MEM_LIMIT only applies to proxies. I don’t think we have a generic way to limit any role to try and stay within 2GB, if you have the limit set as 4GB?

KrzysFR · July 2, 2018, 7:22pm

@alexmiller thanks, that’s what I was looking for. This does not seem to be explained in the documentation.

I assume that the convention is that knob FOO_BAR_BAZ in Knobs.h will be exposed as knob_foo_bar_baz in the .conf file ? Or is there another file that maps the key in the conf file into the knob constant?

note: I’m using github’s search to look for the places in the code, but I really don’t trust it to be accurate, because all I see is https://github.com/apple/foundationdb/search?q=SERVER_MEM_LIMIT&unscoped_q=SERVER_MEM_LIMIT

Topic		Replies	Views
Performance tuning and availability Running FoundationDB performance	0	365	June 10, 2022
Why doesn't my cluster performance scale when I double the number of machines? Using FoundationDB performance	20	3265	August 17, 2018
Cluster scaling Using FoundationDB performance	0	268	November 20, 2023
Understanding about testing class and performance tuning in Foundationdb Kubernetes Operator performance	2	591	February 15, 2023
Transaction/operation throughput Using FoundationDB performance	10	1966	January 23, 2020

Cluster tuning cookbook

Related topics