How to troubleshoot throughput performance degrade?

You should definitely run more than one storage process per disk. See my explanation here: Storage queue limiting performance when initially loading data

The “end” of the key is not a monotonically increasing value; they are random apart from the 1-byte prefix. I’m not sure how much of the key is the “beginning”. Let me detail the exact format of the primary keys.

  1. ‘k’ key: 'k' + key (N random bytes in [a-zA-Z])
  2. ‘s’ key: 's' + 2 random bytes (chosen by hash of key) + 4 bytes of a uint32 representing a timestamp + key

If you have two bytes before the timestamp and those two bytes are sufficiently distributed that does rule out the number of prefixes being the problem. I had imagined a key like this

events/1
events/2
events/3

etc.

Someone actually did my recommended benchmarking :open_mouth:

If you are not using enterprise-grade SSDs, I think you’d actually be much better off not raid 0’ing your SSDs, and just running one (or more) fdbservers per disk, rather than many fdbservers for one raid0-ed disk. When FDB calls fsync on a raid0 drive, it’d have to fsync on all of the SSDs, so you’d potentially be having every drive do O(fdbservers) fsyncs, versus O(fdbservers/disks) fsyncs if you don’t raid them. More fsyncs typically means less drive lifetime. The enterprise-grade caveat is that the better classes of SSDs just include a large enough capacitor that they can flush out all the data if power is lost, so that they can turn fsync into a no-op, and thus it doesn’t matter anyway.

I later re-read and saw you’re running on AWS, so the disk lifetime will be of less concern to you, but I’m still not sure that you’ll get better behavior with raid-ing. Either way, I’ll leave the above in case anyone ever reads this with physical disks in mind.

Be careful about sharing disks between transaction logs, because they care just as much about bandwidth as they do fsync latency, and having transaction logs compete for fsyncs is much more dangerous for overall write throughput than storage servers competing for fsyncs.

Your math seems about right, except that the scaling probably isn’t linear. I’ve gone and dusted off my benchmarking results database, and I had run write bandwidth tests with single replication and got:

1 proxy 1 tlog = 50MB/s
3 proxy 3 tlog = 120MB/s
5 proxy 5 tlog = 170MB/s
7 proxy 7 tlog = 200MB/s

For a workload that I don’t remember, but was probably 100B values.

Numbers irrelevant, but comes out to a quadratic regression of -2.5 x^2 + 45. x + 7.5 being a pretty good fit. Your baseline is 1.5x better than mine, which… hrm. ~10 processes hits the peak on the quadratic equation at 300MB/s. So… it actually turns out I don’t have advice for you here, because I don’t have data laying around past what you can do with ~12 transaction class processes.

My guess is that you’ll probably fall somewhere in 10-15 logs to make it work, an equal number of proxies, and then a few more to give yourself a bit of headroom. I am concerned though that you’re pushing up against the limits of what FDB can do for write throughput right now, so if your use case is going to grow over time, you might be boxing yourself in here. FDB scales reads much better than writes, so a 40:1 ratio favoring writes is going in the wrong way for us.

FWIW, FDB6.2 will make a couple things default that resulted in a ~25% write bandwidth improvement for my tests. You can get them now if you fdbcli> configure log_version:=3 log_spill:=2. But a 25% change isn’t really going to make or break your use case here…

My guess is that you’ll probably fall somewhere in 10-15 logs to make it work

I rerun the load testing with 10 * i3.16x (80 disks total) with 10 log and 70 storage processes to 15 log and 65 storage processes. On each disk there is only one log or storage process, and an extra stateless process. In total always run the same number of proxy processes as log processes and about half of the number of resolver processes.

All configurations have similar results. The highest throughput I could reach is around 13K/s, there are always 2 or 3 storage queues way higher than the others. Usually one storage queue is around 900MB and one is at 1.5GB. At the same time log queues are all at 1.5GB threshold. Even if I add more disks, reaching 13 log 107 storage processes, the behavior is the same. From @mengxu’s post I think the bottleneck is at storage processes e-brake.

22%20PM

@ryanworl’s response makes me wonder if the issue coming from our data modeling. We model our key, value with expiry by introducing 4 kinds of auxiliary keys. Majority of them are ‘s’ and ‘k’ keys since each of our logical key has these two FoundationDB keys:

‘k’ key: 'k' + key (N random bytes in [a-zA-Z])
‘s’ key: 's' + 2 random bytes (chosen by hash of key) 
                + 4 bytes of a uint32 representing a timestamp + key

Is it because most of foundationDB keys start with ‘k’ and ‘s’ so most of the traffic fall on 2 storage processes?

I am concerned though that you’re pushing up against the limits of what FDB can do for write throughput right now

Is this write throughput limit posted somewhere? Our traffic is around 20K/s level and expect growing over time.

Nope. The key space will be split into shards (key ranges) almost evenly based on shards’ data size.
Eventually, data distribution will try to make sure each storage server team has similar amount of data.
But there is a delay.
During peak workload, data distribution does not have enough time and computation resource to move shards. That’s why hot spot cannot be quickly mitigated.

One thing I feel interesting/weird is that under triple replica, when hot spot happens on a shard, it typically causes 3 storage servers very busy (say high storage queue). In your figure, only 2 storage servers are busy, which makes me curious what happens to the 3rd storage server for the shard?

Oh yeah it’s also quite common there are 3 queues very busy, but usually the third one is less than 900MB. It also reminds me when there are few huge storage queue, sometimes I could see an alert message similar to “system is behind due to IP:port is behind” in “fdbcli —exec status”, sometimes I won’t see that message.

I could pause my client and cluster could catch up the huge storage queues in less than 1 minute. But when I resume my client, queue grows back to threshold within few minutes. Does that mean the traffic I set is simply too high? Is there other way to raise storage process’s priority on moving data?

One way to see the relative load on each the storage servers is to look at the quantity of writes and reads to each. If you are generating your graphs using the status json output, this can be done by getting cluster.processes.<id>.roles.input_bytes (related to bytes written), cluster.processes.<id>.roles.total_queries (number of reads), and cluster.processes.<id>.roles.bytes_queried (bytes read). There are other metrics there too that could provide more information on other dimensions (such as number of mutations, keys read, etc.).

If you are generating your graphs from the trace logs rather than status, you can also get this information from the StorageMetrics event. In that case, you’d get the data from the BytesInput, QueryQueue, and BytesQueried fields, respectively.

If you see that one of these metrics is notably higher on the processes where your queues get larger, it could give you an idea where to focus your efforts. If none of these metrics look any different on the processes where your queues get large, that would be interesting too.

1 Like

The traffic to the 3-server team is too high. The traffic to the entire cluster may not if other storage servers are under used. To make sure all storage servers to be used, data modeling needs to spread the writes to different key spaces (or shards).

As to data distribution (DD) question, the current DD moves data based on data size instead of data-access frequency. I doubt DD will help in this situation.

Since I could pause my client and cluster could catch up the huge storage queues in less than 1 minute., it is another evidence that the storage server simply cannot catch up with the write throughput to it.

I do also find it very suspicious that your first graph had a large number of storage servers all struggling, but your second one had only two.

I don’t see anyone mentioning it here, but if you’re starting from an empty database, then FDB is going to have to compete with your workload to try and split your data sufficiently so that the writes are spread across the servers.

Our own write bandwidth tests pre-load a large volume of data to create enough shards to distribute writes well, wait for data distribution to finish, and then run the actual write bandwidth test.

KrzysFR posted an example of this with good stats from his tooling before.

1 Like

I rerun the load testing, there were 3 storage processes with higher queue size in the beginning, but this time the third largest queue size shrink pretty soon. In storage processes input_bytes.hz, after running for a while, they basically form in 3 groups. The first group (~= 15MB/s one) are the 3 processes with higher queue size in the beginning. For other metrics you mentioned, all processes emitting metrics in similar range.

29%20PM
43%20PM

However, even if I know these 3 processes write more data than the other processes, how am I going from that? That 15MS/s is way lower than each underlying SSD’s capability, and they are the only one process writing to their own disk.

Our implementation didn’t use tuple layer but directly use byte array as key. For example our s key have 4 segments ('s', 2 bytes of hash(key), 4 bytes of timestamp, key), we use the concatenation of bytes representation of each segment as the FoundationDB key.

Does this consider as “different key spaces” or all are in one key space?

They should be splitter to different shards by data distributor.

I’m wondering if you preloaded some small amount of data that pre-creates those shards as Alex suggested?

Basically “stored_bytes” are never imbalanced from the beginning till 2 or 3 storage processes storage queues are way behind during my load testing. Do I still need to pre-load data?

44%20PM

From KrzysFR’s post Evan mentioned

, but since data distribution looks good, that might be a different issue.

Hmmm. That would probably suggest that you’re managing to split reasonably.

Can you pastebin a status json so I can double check a couple things?

Here is the “status json” output.

I’ve been helping Ray debug this. We’ve managed to simplify the problem considerably. We can reproduce it by writing fully random keys (no reads, no prefixes) at a high rate (saturating FDB, so about ~50-100kps) for about 6-7 minutes. Following that, FoundationDB’s log queue size spikes, storage queue size diverges (different processes having wildly different sizes), and throughput drops.

Here’s an example with 10 i3.16xlarge instances running at 3x redundancy.

Here’s 3 i3.16xlarge instances running at 1x redundancy. (A single instance had its log queue instantly maxed out, which didn’t allow us to demonstrate this odd divergence behavior).

Here’s 10 i3.16x large instances running at 1x redundancy. This is stable, suggesting that the problems we’re observing have something to do with the ratio of total machine/disk count to workload. However, I still don’t fully understand what’s happening.

My current best guess is that there’s some kind of cleanup or other secondary work being done after that 6-7 minute mark, which causes the load on the log processes to exceed its handling capacity. I’m not clear on what that would be, however.

Any feedback would be enormously appreciated.

I assume in the evaluation, the tLog and storage server uses different SSDs. In other words, a SSD won’t be shared by tLog and storage server processes. Is that correct?

@ajbeamon should have a better explanation.

That’s correct! Each machine has 8 SSDs, and only one stateful process is assigned to each disk, so there’s no disk sharing between log and storage processes.

If it would be helpful, I can try to clean up our load generation code to make it easier to repro. I can also share any other metrics that’d be useful.

I think the log queues are growing here because the storage queues are. This is because anything in flight on the storage servers is as in flight on the logs (the logs can’t remove data from their queues until the storage servers that need that data have made it durable).

My guess at the diverging behavior is that each of the processes is experiencing roughly the same problem, but with slight variations in rate and start time. Depending on your configuration, some number of storage queues will reach ~1GB and then level off as ratekeeper kicks in. Any storage server that’s doing a little bit better than the limiting storage server will likely then have its queue fall back down since it’s able to do just a bit more than is required of it.

The remaining question is why your disks suddenly can’t keep up after several minutes of running. One possibility is that this is a property of the disk, for example due to SSD garbage collection starting after a while. We’ve seen many cases of disks behaving markedly worse after periods of sustained load, and that explanation would match the behavior here reasonably well. How busy are the disks during your test?

If you are starting the test from an empty or small database, it may also be possible that the performance regime changes as your b-tree gets larger than cache, resulting in an increased number of reads going to disk (I’m assuming the ssd storage engine). It feels a little bit like that should appear as a more gradual degradation, but maybe not. If you are running your tests from empty, you could potentially eliminate this effect by pre-loading your database to a decent size. You could also limit the writes in your test to existing keys, in which case the database shouldn’t change in size so long as your value sizes are similar.

Nothing else immediately comes to mind to explain why performance would degrade over time. Assuming the problem isn’t something bad going on in the cluster, the resolution to this is likely going to be that you’ll either need to decrease your load or increase your computing resources. It seems like the storage servers do ok at 1/3 load (from the 10 instance 1x redundancy test), so running 30 instances at 3x redundancy seems likely to work.

1 Like

Thanks for the thorough response!

The remaining question is why your disks suddenly can’t keep up after several minutes of running. One possibility is that this is a property of the disk, for example due to SSD garbage collection starting after a while. We’ve seen many cases of disks behaving markedly worse after periods of sustained load, and that explanation would match the behavior here reasonably well. How busy are the disks during your test?

We see per-disk writes on the order of 20-40kbps. Per Amazon’s specs, the i3.16xlarge instance should support 1.4M IOPS, so we’re almost two orders of magnitude below the listed performance limits (if I’m reading them correctly).

If you are starting the test from an empty or small database, it may also be possible that the performance regime changes as your b-tree gets larger than cache, resulting in an increased number of reads going to disk (I’m assuming the ssd storage engine).

This is a very good suggestion. After my last post, we discovered that performance instantly diverged when we ran on a “pre-loaded” database, rather than taking several minutes / ~20GB of stored data first. This makes your hypothesis seem quite likely.

We’re currently doing tests on scaling, to confirm that adding more machines increases our throughput. (Earlier tests were inconclusive.) I’ll post again once that’s done.

Thanks again!