How to troubleshoot throughput performance degrade?

performance
(Markus Pilman) #6

It does look a bit like disk saturation to me. Or in general it looks like the storage nodes are unable to write to disk quickly enough (could also be a priority issue - reads are blocking writes - which is something we see quite often and which is why we are working on the local ratekeeper).

But I don’t think @rayjcwu is running on EBS. I don’t know what the max throughput of the ephemeral disks on these machines are (but the i3s have very very fast disks).

1 Like
(Ryan Worl) #7

You definitely need more than 4 prefixes. With only 4 prefixes and triple replication, only (4 x 3 replicas) 12 storage processes (about ~10% of the processes in your case) are writing at any given time. This means one will fall behind, the shard gets split because it is too hot, the next one falls behind, etc. Also you don’t get high enough queue depth to saturate the devices.

I would also try out the memory engine if you are going to be rapidly deleting data after you receive it.

(bancron) #8

I’d like to make sure we’re all on the same page about what a prefix is. Currently we are running tests with keys of the form kXX.... and sYY.... where X is a random ASCII letter and Y is a random byte. Is this only two prefixes? How many bytes does each storage process use for sharding?

#9

Thanks for the reply. We currently create a raid 0 array on all SSDs and run FoundationDB on it. In progress on changing the code to let each log/storage process run on their own disk and own physical CPU.

the 211 MB/s logical write rate seems higher than I would have initially thought to be sustainable.

I’m actually not sure how to estimate the disks requirement to sustain our traffic.

I tried something similar to this approach. In single redundancy mode, having 1 log 19 storage, log process has durable_bytes.hz around 75MB. Having 1 storage 19 log, storage process has durable_bytes.hz around 15MB. (Even though in that post it used mutations, but I don’t know how to use that number). So in our case, the ratio of log:storage processes should be 1:5.

To sustain 633MB/s physical traffic, I’ll need around 10 log processes, and 50 storage processes, each with their own disk. Which means a cluster of 60 disks for triple redundancy mode.

Does the above estimation look correct?

However, If I use dd on the single disk without any other things, the write rate could be 300+MB/s (depends on the blocksize). Does that mean I am able to run more than one log/storage process per disk?

StorageServerUpdateLag indicates that storage server is in e-brake,

Thank you. By searching on GitHub repo, it’s the only one e-brake event.

(Ryan Worl) #10

If the end of your key is a monotonically increasing value, and the beginning of your key only has 4 different values, your writes are all going to 4 shards. Over time those shards will be split into more shards as they grow in size and write bandwidth, but the old shards will never receive any more writes, so you’re back at only 4 shards being written to.

It doesn’t matter if the prefix is 1 byte or 100 bytes.

(Ryan Worl) #11

You should definitely run more than one storage process per disk. See my explanation here: Storage queue limiting performance when initially loading data

(bancron) #12

The “end” of the key is not a monotonically increasing value; they are random apart from the 1-byte prefix. I’m not sure how much of the key is the “beginning”. Let me detail the exact format of the primary keys.

  1. ‘k’ key: 'k' + key (N random bytes in [a-zA-Z])
  2. ‘s’ key: 's' + 2 random bytes (chosen by hash of key) + 4 bytes of a uint32 representing a timestamp + key
(Ryan Worl) #13

If you have two bytes before the timestamp and those two bytes are sufficiently distributed that does rule out the number of prefixes being the problem. I had imagined a key like this

events/1
events/2
events/3

etc.

(Alex Miller) #14

Someone actually did my recommended benchmarking :open_mouth:

If you are not using enterprise-grade SSDs, I think you’d actually be much better off not raid 0’ing your SSDs, and just running one (or more) fdbservers per disk, rather than many fdbservers for one raid0-ed disk. When FDB calls fsync on a raid0 drive, it’d have to fsync on all of the SSDs, so you’d potentially be having every drive do O(fdbservers) fsyncs, versus O(fdbservers/disks) fsyncs if you don’t raid them. More fsyncs typically means less drive lifetime. The enterprise-grade caveat is that the better classes of SSDs just include a large enough capacitor that they can flush out all the data if power is lost, so that they can turn fsync into a no-op, and thus it doesn’t matter anyway.

I later re-read and saw you’re running on AWS, so the disk lifetime will be of less concern to you, but I’m still not sure that you’ll get better behavior with raid-ing. Either way, I’ll leave the above in case anyone ever reads this with physical disks in mind.

Be careful about sharing disks between transaction logs, because they care just as much about bandwidth as they do fsync latency, and having transaction logs compete for fsyncs is much more dangerous for overall write throughput than storage servers competing for fsyncs.

Your math seems about right, except that the scaling probably isn’t linear. I’ve gone and dusted off my benchmarking results database, and I had run write bandwidth tests with single replication and got:

1 proxy 1 tlog = 50MB/s
3 proxy 3 tlog = 120MB/s
5 proxy 5 tlog = 170MB/s
7 proxy 7 tlog = 200MB/s

For a workload that I don’t remember, but was probably 100B values.

Numbers irrelevant, but comes out to a quadratic regression of -2.5 x^2 + 45. x + 7.5 being a pretty good fit. Your baseline is 1.5x better than mine, which… hrm. ~10 processes hits the peak on the quadratic equation at 300MB/s. So… it actually turns out I don’t have advice for you here, because I don’t have data laying around past what you can do with ~12 transaction class processes.

My guess is that you’ll probably fall somewhere in 10-15 logs to make it work, an equal number of proxies, and then a few more to give yourself a bit of headroom. I am concerned though that you’re pushing up against the limits of what FDB can do for write throughput right now, so if your use case is going to grow over time, you might be boxing yourself in here. FDB scales reads much better than writes, so a 40:1 ratio favoring writes is going in the wrong way for us.

FWIW, FDB6.2 will make a couple things default that resulted in a ~25% write bandwidth improvement for my tests. You can get them now if you fdbcli> configure log_version:=3 log_spill:=2. But a 25% change isn’t really going to make or break your use case here…

#15

My guess is that you’ll probably fall somewhere in 10-15 logs to make it work

I rerun the load testing with 10 * i3.16x (80 disks total) with 10 log and 70 storage processes to 15 log and 65 storage processes. On each disk there is only one log or storage process, and an extra stateless process. In total always run the same number of proxy processes as log processes and about half of the number of resolver processes.

All configurations have similar results. The highest throughput I could reach is around 13K/s, there are always 2 or 3 storage queues way higher than the others. Usually one storage queue is around 900MB and one is at 1.5GB. At the same time log queues are all at 1.5GB threshold. Even if I add more disks, reaching 13 log 107 storage processes, the behavior is the same. From @mengxu’s post I think the bottleneck is at storage processes e-brake.

22%20PM

@ryanworl’s response makes me wonder if the issue coming from our data modeling. We model our key, value with expiry by introducing 4 kinds of auxiliary keys. Majority of them are ‘s’ and ‘k’ keys since each of our logical key has these two FoundationDB keys:

‘k’ key: 'k' + key (N random bytes in [a-zA-Z])
‘s’ key: 's' + 2 random bytes (chosen by hash of key) 
                + 4 bytes of a uint32 representing a timestamp + key

Is it because most of foundationDB keys start with ‘k’ and ‘s’ so most of the traffic fall on 2 storage processes?

I am concerned though that you’re pushing up against the limits of what FDB can do for write throughput right now

Is this write throughput limit posted somewhere? Our traffic is around 20K/s level and expect growing over time.

(Meng Xu) #16

Nope. The key space will be split into shards (key ranges) almost evenly based on shards’ data size.
Eventually, data distribution will try to make sure each storage server team has similar amount of data.
But there is a delay.
During peak workload, data distribution does not have enough time and computation resource to move shards. That’s why hot spot cannot be quickly mitigated.

One thing I feel interesting/weird is that under triple replica, when hot spot happens on a shard, it typically causes 3 storage servers very busy (say high storage queue). In your figure, only 2 storage servers are busy, which makes me curious what happens to the 3rd storage server for the shard?

#17

Oh yeah it’s also quite common there are 3 queues very busy, but usually the third one is less than 900MB. It also reminds me when there are few huge storage queue, sometimes I could see an alert message similar to “system is behind due to IP:port is behind” in “fdbcli —exec status”, sometimes I won’t see that message.

I could pause my client and cluster could catch up the huge storage queues in less than 1 minute. But when I resume my client, queue grows back to threshold within few minutes. Does that mean the traffic I set is simply too high? Is there other way to raise storage process’s priority on moving data?

(A.J. Beamon) #18

One way to see the relative load on each the storage servers is to look at the quantity of writes and reads to each. If you are generating your graphs using the status json output, this can be done by getting cluster.processes.<id>.roles.input_bytes (related to bytes written), cluster.processes.<id>.roles.total_queries (number of reads), and cluster.processes.<id>.roles.bytes_queried (bytes read). There are other metrics there too that could provide more information on other dimensions (such as number of mutations, keys read, etc.).

If you are generating your graphs from the trace logs rather than status, you can also get this information from the StorageMetrics event. In that case, you’d get the data from the BytesInput, QueryQueue, and BytesQueried fields, respectively.

If you see that one of these metrics is notably higher on the processes where your queues get larger, it could give you an idea where to focus your efforts. If none of these metrics look any different on the processes where your queues get large, that would be interesting too.

1 Like
(Meng Xu) #19

The traffic to the 3-server team is too high. The traffic to the entire cluster may not if other storage servers are under used. To make sure all storage servers to be used, data modeling needs to spread the writes to different key spaces (or shards).

As to data distribution (DD) question, the current DD moves data based on data size instead of data-access frequency. I doubt DD will help in this situation.

Since I could pause my client and cluster could catch up the huge storage queues in less than 1 minute., it is another evidence that the storage server simply cannot catch up with the write throughput to it.

(Alex Miller) #20

I do also find it very suspicious that your first graph had a large number of storage servers all struggling, but your second one had only two.

I don’t see anyone mentioning it here, but if you’re starting from an empty database, then FDB is going to have to compete with your workload to try and split your data sufficiently so that the writes are spread across the servers.

Our own write bandwidth tests pre-load a large volume of data to create enough shards to distribute writes well, wait for data distribution to finish, and then run the actual write bandwidth test.

KrzysFR posted an example of this with good stats from his tooling before.

1 Like
#21

I rerun the load testing, there were 3 storage processes with higher queue size in the beginning, but this time the third largest queue size shrink pretty soon. In storage processes input_bytes.hz, after running for a while, they basically form in 3 groups. The first group (~= 15MB/s one) are the 3 processes with higher queue size in the beginning. For other metrics you mentioned, all processes emitting metrics in similar range.

29%20PM
43%20PM

However, even if I know these 3 processes write more data than the other processes, how am I going from that? That 15MS/s is way lower than each underlying SSD’s capability, and they are the only one process writing to their own disk.

Our implementation didn’t use tuple layer but directly use byte array as key. For example our s key have 4 segments ('s', 2 bytes of hash(key), 4 bytes of timestamp, key), we use the concatenation of bytes representation of each segment as the FoundationDB key.

Does this consider as “different key spaces” or all are in one key space?

(Meng Xu) #22

They should be splitter to different shards by data distributor.

I’m wondering if you preloaded some small amount of data that pre-creates those shards as Alex suggested?

#23

Basically “stored_bytes” are never imbalanced from the beginning till 2 or 3 storage processes storage queues are way behind during my load testing. Do I still need to pre-load data?

44%20PM

From KrzysFR’s post Evan mentioned

, but since data distribution looks good, that might be a different issue.

(Alex Miller) #24

Hmmm. That would probably suggest that you’re managing to split reasonably.

Can you pastebin a status json so I can double check a couple things?

#25

Here is the “status json” output.