Process class and machine sharing deployment questions

arob · August 28, 2019, 2:29pm

Hi, I have a few questions related to deploying FoundationDB that I wasn’t able to confidently find answers to in the docs. If anyone here is able to help out with any of them, I’d really appreciate it.

Are there any safety implications to not telling foundationdb how many instances of each process class you want? Or of not assigning process classes to individual processes?
1a. Assuming it’s purely a performance hit (as suggested but not outright stated by previous threads on this topic and as outlined in https://github.com/apple/foundationdb/issues/552), roughly how bad can it get?
1b. Is there a benefit to specifying the number of instances of each process class that we want even if we can’t assign classes directly to individual processes?
1c. What happens if the number of processes in the cluster (due to machines being added or removed) is different from the total number of requested process class instances?
Can the coordinators be on machines that don’t take on any other roles in the cluster?
If each fdbserver process limits itself to only using one core, can performance problems arise from other processes running on the same machine and potentially using some of that core’s time? I didn’t see anything in the docs about whether fdb does anything to try to reserve that core for itself, or if it entirely relies on the OS scheduler.

Thanks for what appears to be an awesome database, and with docs that are quite comprehensive for the most part! I’ve just struggled a bit to make sure I correctly understand some of the finer concerns around configuring a cluster.

ajbeamon · August 28, 2019, 4:33pm

Before answering any of the questions, I want to define some vocabulary that I’ll be using. A role is a particular job in the cluster that can be assigned to processes, such as the storage server, transaction log, proxies, etc. Processes can take on more than one role.

A process class is a property of each process which determines its affinity for taking on certain roles. Depending on the class and some other factors, a process may or may not actually be assigned any roles. @guarav put together a useful matrix for how process roles can be assigned to classes: Roles / Classes matrix

There shouldn’t be any safety concerns with running your processes with an unspecified class. There also shouldn’t be any safety concern with not specifying your desired number of various roles. The reason to set both is for performance reasons.

If you don’t set the desired number of various roles, your throughput will be limited to whatever the default number of each role can handle. If this is enough throughput for you, there shouldn’t be an issue, but if you start running into a bottleneck with one or more of the roles, you’ll need to request more.

If you don’t set process classes, there are a couple different things that can happen. I believe that one is that you will recruit a storage server on every process, and other roles will be recruited alongside them on the same processes. This means that some of your storage servers will be dedicating some of their CPU resources to other tasks, reducing your storage server capacity. You also could be subjected to various spillover effects if you end up saturating a colocated role. For example, if your read workload is too hot on a storage server that is shared with a proxy, you’ll end up starving the proxy and interfering with the commit path. If your cluster is lightly loaded, I wouldn’t expect this to be a big deal, but otherwise it’s best to avoid a configuration like this.

Another issue is that when transaction logs and storage servers share a disk, you’ll get suboptimal performance. The two roles have very different disk use patters, with the logs performing frequent small fsyncs, and the storage servers doing lots of reads and less frequent fsyncs. When both are using the same disk, you’ll likely see increased latencies from the interference of the two.

In both cases, I’m not certain of the magnitude of the performance loss, but I’d recommend doing a performance test if you’re considering a non-trivial cluster with an unset layout. I do know that in the first case, having co-located roles makes things a lot less clear when trying to diagnose issues.

Specifying the number of each role desired is slightly orthogonal to assigning classes to processes for some of the reasons discussed above. Choosing the number decides what capacity the cluster has for work of the given role type, and you’ll need to do this to scale up if your cluster has reached a bottleneck on a role.

The cluster has a minimum requirement for each role (in some cases, 1). For roles like the transaction log, there is also a requirement that there be enough fault tolerance zones (i.e. separate machines unless configured otherwise). If you don’t meet the minimum requirements, then your cluster will fail to startup and be unavailable. Assuming you meet the minimum requirements, the cluster will recruit as many of each role as it can up to your desired amount while meeting various rules about recruitment.

There’s another fault tolerance related issues that can arise in some configurations that’s worth mentioning here. If you request N transaction logs and you run N transaction class processes (or probably if you have only N processes of unset class), then your cluster will be at risk of availability loss if one of the transaction log processes fails in a way such that it is repeatedly disconnecting and reconnecting to the cluster. What happens in this case is that each time the process disconnects, it will cause the cluster to go through a recovery, and then when the process rejoins, it will recover again to add back the transaction log. These recoveries can take a few seconds, and if frequent enough you won’t be able to get much done.

For this reason, it may be of benefit to run 1 or 2 extra processes (depending on your desired fault tolerance) of each non-storage class, preferably on different machines. For example, a good configuration for a triple redundant cluster would be to have N+2 transaction class processes and M+2 stateless processes, where N is the desired number of transaction logs and M is the number of processes running stateless roles such as proxy, resolver, etc.

Yes, and I think there are two ways to accomplish this. One is to exclude the process that you want to do this to, as a coordinator is the only thing that can run on an excluded process. The better option is to use the coordinator class, which I don’t think is in the matrix above but I believe is only allowed to be a coordinator.

FoundationDB doesn’t attempt to do anything with processor affinity, so it’s up to the OS to decide where processes run. There may be some benefit to trying to configure the affinity, but I’m not sure how much to expect. One thing you may want to be careful about is running too many processes for the number of physical cores on the machine. If you manage to saturate the CPU on the entire machine, you’re likely to get some undesired behavior, such as high latencies or processes being detected as failed, etc.

arob · August 28, 2019, 5:17pm

Thank you for the information! In my context it’d be fairly burdensome to have to configure process classes and role counts for multiple clusters of various (changing) sizes, so I would at least like to see if it’s practical to run a cluster without doing so.

It sounds like I’ll have to just test things out for myself to see the performance impact, which is totally fair. I’ll come back and post what I find for anyone else who’s curious.

arob · August 29, 2019, 3:35pm

Quick followup to help make the testing easier – is looking at each process’s XML logs the only way to determine how many instances of each role there are and which roles are being fulfilled by which processes or is there a more centralized way?

ajbeamon · August 29, 2019, 3:40pm

The easiest way to get this information is from the JSON status document. You can get it from fdbcli by running status json or you can query it from any client by reading the key \xff\xff/status/json. Inside this document, there will be an object cluster.processes that maps process IDs to a description of that process. Each object in that map represents a running fdbserver process and will have a field called roles that lists each role assigned to the process. Included in each role object is a field name role which is the name of that role.

One caveat is that coordinators aren’t included in the roles list until an upcoming release (probably 6.2.3).

arob · September 3, 2019, 3:53am

So I did some testing the other day and unexpectedly found that performance was best when I didn’t configure anything. That probably means I did a bad job configuring the cluster, or that my workload was too naive for better role allocation to help, but I still found it surprising.

I used 10 n2-standard-2 GCE VMs (2 vCPUs and 8GB memory each) with local SSDs. I used ext4 and mounted the disks with -o nobarrier. I ran two fdbserver processes on each VM, and sent load from a separate 16-core machine. I ran a single coordinator process on a separate VM in the same GCE zone.

The workloads I tried were super simple, just C++ programs using the standard client library that did single-key writes or single-key reads, with very small key and value sizes, ranging from 1 to 8 bytes. I didn’t use any batching of operations; each was done in its own transaction. I had to run multiple load generator processes in order to max out the cluster’s throughput, otherwise the client was the bottleneck. For each configuration of the cluster I tried a handful of numbers of client processes/threads and also tried sending load from multiple client VMs.

On just a default cluster with no process classes specified, I got 120k-130k single-key writes per second, or about 300k single-key reads per second.

I then found the 3 processes that were acting as transaction logs, explicitly marked the as the “transaction” class, and set the other process on those 3 machines to be of the “stateless” class to avoid disk contention. This brought throughput down to roughly 105k writes per second, with reads not meaningfully changing, measuring in around 290k per second.

I then un-did the explicit setting of process class and tried requesting certain numbers of roles. Here I suspect I messed up - rather than following the explanatory text that appears to suggest 10 log servers and 8 stateless processes, I followed the example fdbcli commands and asked for 5 proxies and 8 log servers. Writes dropped to around 100k per second, with reads seemingly unaffected.

As one last test, I undid the role changes and tried running with just one process per machine. I was surprised to see that reads were mostly unaffected while writes got cut nearly in half. This is odd, since it would seem to suggest that the cluster was more CPU-bound than I/O-bound, but only on writes. That seems like it’s probably related to the fact that isolating transaction logs from storage servers hurt more than it helped, since you’d only expect that to help if disk I/O was mostly saturated. Perhaps larger value sizes would better exhibit the expected change in performance.

So based on this testing, explicit roles/process classes weren’t a clear win, but it was very limited testing.

ajbeamon · September 3, 2019, 4:24pm

One important question here is what the bottleneck was while you were running the tests. If the bottleneck was the storage servers, then a 30% reduction in storage processes may have been more costly than the benefit gained from separating the other roles (though it’s still possible that you pay a latency penalty for not separating the other roles if you are running below saturation). Similarly, increasing the number of logs and proxies on a cluster that was storage bound probably wouldn’t have much benefit, but would have an extra cost in particular when they are being shared with storage servers.

Are these read and write tests run separately or concurrently?

I think there are a few things that could be contributing to this. In addition to the possibility that you are CPU saturated, another is that with half the number of processes, you will end up with a lot more processes sharing roles with storage servers, reducing their effectiveness. In a read-only test, this probably wouldn’t matter as much because reads don’t drive a ton of load to these other processes, but writes would probably have more of an impact.

The other thing is that the SSD storage engine only has a write queue depth of 1 per process. That means that you may not be able to push your disk to maximum throughput with only one process. See SSD engine and IO queue depth for some additional discussion of this.

As a further experiment, you could try a configuration with a higher storage to other process ratio. For example, we’ve run some clusters with a 7:1 storage to transaction disk ratio, with 2 storage servers per disk. Stateless processes don’t use the disk and can safely run on the same machine as storage processes as long as you have sufficient CPU resources.

Also, be a little wary of using vCPU counts to budget for your processes. It’s probably best to have a physical core available for each process, but if you don’t, you’ll want to be careful that you don’t end up saturating the machine CPU, which could have a number of undesirable and potentially confusing effects.

alexmiller · September 3, 2019, 5:12pm

As the first operation implicitly does a getReadVersion() call if one wasn’t done before, this probably turned into a GetReadVersion benchmark. Your throughput would then end up dominated by proxy network/CPU rather than transaction log writes or storage server reads. Which if that’s realistic for you, is fine, just important to be intentional about.

This would then cause/imply that your first cluster was running 3 transaction logs and 3 proxies?

Which would then recruit 8 logs and 5 proxies. I think you effectively actually benchmarked proxies=3 logs=3 vs proxies=5 logs=8 for your workload.

That suggested configuration completely ignores any mention of the total number of processes in a cluster… running 10 logs with 10 processes is going to be overkill, since 10 logs can write faster than 10 storage servers, so you’ll end up limited by the latter in a long write-heavy test.

If you were still in proxies=5 logs=8 configuration, you would have just put a proxy, log, and storage server on a majority of your processes. Proxy operations are prioritized higher than transaction log or storage server operations, so maybe this makes sense, but I am a little surprised at your results.

ajbeamon · September 3, 2019, 5:53pm

That’s an interesting observation. For the first test with explicitly set transaction and stateless processes, you might expect that separating the proxies would make things better in this regard, though because there were only 3 stateless processes the proxies had to be paired up with the other stateless roles rather than the storage nodes they were paired up with before. It’s plausible that the result of that could be worse, and running more stateless processes should help.

As for the second test where proxies was configured to 5, that itself shouldn’t have resulted in a decrease in throughput if we were bound on the proxies, but maybe the simultaneous increase in logs meant that the proxies were tripling up with logs and storage, and as a result they had less total capacity than with 3. You could try a proxies=5 logs=3 config, which seems like it should be better than proxies=3 logs=3 for handling GRV requests.

It’s hard to say for certain here what the bottleneck was without some additional details from the cluster during the test. Unless you specifically want to test the number of transactions you can start, though, the best bet is probably to increase the number of operations per transaction and eliminate that as a potential issue.

arob · September 3, 2019, 6:30pm

Thanks for the feedback and insights, AJ and Alex! It’s really helpful.

I should probably do some additional testing with larger values and multiple operations per transaction, but I’m not sure how soon I’ll have the time do so. Hopefully this week.

I was a bit unsure of this because none of the cluster’s resources were fully utilized according to fdbcli status. CPU was most heavily utilized, but mostly under 50%, with a couple of processes sometimes getting up into the 60-70% range on certain workloads.

When I cut the number of processes in half to just one per machine, then CPU pretty clearly became the bottleneck according to the numbers in fdbcli status

Separately. I didn’t do any mixed workloads.

Ah, very interesting. Thanks for calling that out.

Also very good to know. I should probably do some additional testing with larger values and multiple operations per transaction. Does GetReadVersion performance scale up with the number of proxy processes?

Yes.

Yes. That’s on me for taking docs literally rather than trying to formulate better numbers, but the lack of any specific guidance around how many of each role should be in clusters of different sizes makes it very tough for anyone who isn’t seasoned in running the system to come up with a good balance.

No, I undid the role changes before going down to one process per machine.

ajbeamon · September 3, 2019, 6:41pm

It’s possible you ran into machine CPU issues with only 1 physical core available. You can also check things like the queue sizes on the storage servers and transaction logs to see if you are getting a backlog of work (only applicable if you are doing writes). Seeing something like that could narrow down where the problem is.

Yes, you should be able to get more read versions if you have more proxies, subject to some of the issues I described such as where more roles are running alongside the proxies, etc.

alexmiller · September 3, 2019, 6:45pm

Sort of? It involves one proxy collecting answers from all other proxies, so adding more proxies doesn’t provide linear scaling. There’s some inflection point where adding more proxies will only hurt, but we’re in the low enough numbers here that I’d expect offering more CPU/network would be a net benefit even if it requires another message to be sent.
Aggressive batching is done both client-side and within the proxy, so it should work out that a higher GRV-per-second target causes that latency to be higher, but should still be possible.

However, someone did some benchmarking before that made it look like proxy-side GRV batching might be broken at saturation, so I’m not feeling overly confident about that.

Configuration and role settings are two separate things. Once you’ve explicitly configured your database via fdbcli> configure logs=8 proxies=5, then that setting will stick as your preference regardless of how you change around process classes.

ajbeamon · September 3, 2019, 7:01pm

My understanding is that both the single process per machine configuration and the two per machine, logs=8 proxies=5 configuration were done with unset process classes. Is that correct?

arob · September 3, 2019, 7:17pm

I’m sorry, apparently I was wrong about this. I thought that’s what I did, but taking a second look at my saved status output it looks like I failed to set those back down to their defaults, so that there were still 8 logs and 5 proxies when I was running with only 10 processes (one on each VM).

ajbeamon · September 3, 2019, 7:18pm

Was this with unset process class?

arob · September 3, 2019, 7:26pm

Yes, process classes were all unset.

Topic		Replies	Views
Cluster tuning cookbook Using FoundationDB	26	8852	February 1, 2019
Production optimizations Using FoundationDB	20	6436	August 15, 2018
Why doesn't my cluster performance scale when I double the number of machines? Using FoundationDB performance	20	3303	August 17, 2018
We have some machines and cpu and a little ssd Using FoundationDB	5	775	July 23, 2019
WARNING: A single process is both a transaction log and a storage server Using FoundationDB	16	1772	August 13, 2019

Process class and machine sharing deployment questions

Related topics