Troubleshooting queue build up

Hi there, I have been troubleshooting some strange queue buildups on the SS process in my FDB cluster (SSD-2, 2-replication, ver 6.0.18). I have not even been able to spot anything obvious standing out that points to this behavior.

It appears, that on a couple of SS nodes, the queue size starts to grow gradually (sometimes two and sometimes just on a single node) and reaches saturation levels.

We do have one component in our data modeling behavior that adds a small key-value pair with a common prefix (that would most likely land on a common shard).

This cluster is of 10 machines hosting 2 fdb process each. One of the FDB processes on each node hosts an SS. The write volume is quite low: ~700 commits/sec with ~3000 key-writes/sec. The bytes write is about 12MB/sec (across the cluster).

I have also checked that the input.bytes is more or less same for all SS nodes (with ±20% from the mean value).

There are no severity 30/40 logs; there is very occasional severity 20 logs. I do see this kind of logs with moderate frequency on the lagging nodes:

<Event Severity="10" Time="1561563010.610610" Type="SSSlowTakeLock2" ID="6adfc187592bcea7" From="0000000000000000" Duration="0.236079" Version="6531103871665" Machine="" LogGroup="default" Roles="SS" />

Few things that may be interesting:

  • Things were running fine on this cluster for around a couple of weeks (cluster had about 2TB of replicated data, and 850GB of unreplicated KV).
  • A few days back, we scheduled a cleaner task, that deletes old data from the store. The “old” data is spread around randomly in the whole KV space. Since then, we have been seeing this peculiar behavior.

As I am running only one SS per disk, could I be saturating the disk with the load mentioned above? I see that we are doing about 1100-1300 write iops/sec (not transaction or keys - but disk write iops) on an average per SS node.

Running following command shows that I am getting about 1200 iops on the disk. I am using iodepth of 1, as I am assuming this is the correct way to simulate fdbserver’s write bandwidth (I’ve read here that it uses only one outstanding write request at a time).

Could this indicate that I am saturating the disk here? Even at such low write rates? And could “deleting” randomly distributed kv rows on an ongoing basis resulting in lesser output per io (due to emptier disk pages) - just shooting in dark here :slight_smile:

fio --filename=/var/lib/ubuntu/fio.test --direct=1 --rw=randwrite --ioengine=libaio --bs=4K --iodepth=1 --name=bw-test --size=100M

ubuntu@platform6:~$ fio --filename=/var/lib/ubuntu/fio.test --direct=1 --rw=randwrite --ioengine=libaio --bs=4K --iodepth=1 --name=bw-test --size=100M
bw-test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/6093KB/0KB /s] [0/1523/0 iops] [eta 00m:00s]
bw-test: (groupid=0, jobs=1): err= 0: pid=2743: Wed Jun 26 16:13:49 2019
write: io=102400KB, bw=4611.8KB/s, iops=1152, runt= 22204msec
slat (usec): min=10, max=8326, avg=24.91, stdev=71.05
clat (usec): min=3, max=709228, avg=838.87, stdev=6296.14
lat (usec): min=313, max=709243, avg=864.18, stdev=6296.50

I will update the thread as I get more understanding, or if someone wants any other specific data points. Just wanted to put out all the observations I have so far to see if someone could spot anything!

I think it is likely you’re saturating the disk here. Your simulation using fio vs the observed throughput doesn’t take into account IO needed for background pruning after your bulk delete nor the IOPS needed for any reads. Disks can’t perform peak write IO in conjunction with read IO. B-trees require reading before writing data, so if any of the pages that needed to be read before you could write were not already in cache, your workload is performing reads and writes at the same time.

Thanks for the suggestion! Just curious, what is typical IOPS expected on a reasonable SSD with above FIO command? I tried this on a regular EBS gp-2 storage with 3K cap on iops, and I got roughly 1500 IOPS rate. Is this a reasonable IOPS rate to expect? If so, it seems that one fdbserver can write data at about 6MB/s (4KB * 1500 iops) in best case (all write IOPS used by SS, b-tree page contains only new data to be written) - does that sound like a reasonable approximation?

If your expected IOPS is 3000 then you may not be saturating the disk. You could confirm testing at higher queue depths or using a mixed read write test. EBS performance is both variable and not great so it is hard to tell.

Your calculation seems accurate although there is more work going on than just one write into FDB equals one write IO. I would test a range of different combinations of read write mix and queue depth to see and try using iostat to see on your real workload how many IOs you’re doing.

We always use iostat -x 1 to see if the disk in question is slow (based on utilization numbers). Clearing random spots in the KV store in the SSD will cause vacuuming and with 6.x the changes are already in there to make that more efficient (the changes from Snowflake which we backported to 5.2.x).

There are knobs to control vacuuming which you can try (which will reduce the rate at which space is returned back to the file system).

Otherwise, you can try and do a full include/exclude to another set of nodes which will stop vacuuming (obviously if you have to do this all the time, then that’s not a solution).

Hi @ryanworl, what I wanted to check here is the theoretical write throughput limits of a single SS node given 1 queue-depth iops throughput:

  • I am assuming that SS writes are done with only a single outstanding write request.
  • I am also assuming that equivalent of this behavior in fio command is to use queue depth of 1.
  • If fio shows about 1200 random write iops limit for a disk (the command I posted earlier), than that will be the best-case write throughput capacity for a single SS process (I am calling this best case assuming that all other read operations are free - SS has a lot more parallelism there, and disks probably can support those in parallel).
  • Again, assuming the best case scenario that each disk page of 4K size consists solely of new data that is being pushed.
  • So does this mean that each SS process can write max of 6MB/s (1500*4K) under these assumptions? I agree that in a real system, this base-case throughput would be harder to achieve due to: (a) not all data in the pages being written will be completely new, (b) background data movement © misc…

Does that reasoning look reasonable? If so, I can then try to keep this check of running fio with single queuedepth and then using the output to make a model of how much write throughput can one SS sustain in the best case (and maybe half of that in average case).

To point out other things that could cause this sort of problem:

  • Your clients could be read hotspot’ing that storage server. Reads are prioritized higher than writes, so too many reads can cause a storage server to lag behind. Be skeptical of this if you’re at 100% CPU usage.
  • You could have a noisy neighbor on that one machine
  • FDB will split shards based on write bandwidth, but I think it actually doesn’t try to then distribute those write-bandwidth shards well. It’s possible that your one storage server is a part of multiple teams that were assigned write-hot shards, and excluding it would force a shard re-assignment that would resolve the issue.
  • A cleaner job doing clear ranges could cause a lot of deferred work. I don’t think we’ve seen FDB6 have saturation issues from a large clear range, but it’s possible it’s AWS/EBS specific? Large clear range performance
  • Data distribution can sometimes cause bad performance effects if it’s too aggressive. I’ve heard other people on EBS needing to dial this down, sometimes.

One storage server only doing an upper limit of ~6MB/s of write sounds reasonable to me. (Lookin’ forward to that faster storage engine, @stevedhams )

Thanks, I will spend more time to see what could be a limiting factor here.

CPU utilization is very moderate - ~35-40%

We are not doing large range clear - instead, we are doing a lot of point clear() - these cleared keys will be spread very uniformly across the key space.

One other thing of note is that each page that you modify actually consists of at least 2 disk writes, one to the WAL file and one to the actual b-tree. Then there’s some potential additional overhead when writes have to take place on additional pages to manage the tree (e.g. splitting and merging pages), though I don’t personally have a good sense how often this happens in practice. I’m sure @SteavedHams could give a better accounting of all of this.

There may also be additional writes on the cluster outside of your workload from data movement, as mentioned bi Alex. In addition to having to write the data on the destination disks, data movement involves clearing the old data and triggers the spring cleaning actions described by others.

Thanks, @panghy. How do you check utilization numbers from iostat output? Is it %util column? I have been discounting this column completely based on following guidance from the man pages:

Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance l

Yeah, I think the number isn’t accurate for nvme or lvm drives but for EBS it is accurate. You can also check AWS to see the idle time for that volume.