We’ve been working on a new chaos and performance testing infrastructure for FDB. Simulation has worked well for automated correctness testing, but we have been spending more and more time manually checking for performance regressions and confirming that FDB performs failure scenarios. Also, we’d like to start comparing FDB’s price/performance and latencies against other systems. As a first step, I’ve been running YCSB workloads against a 240 disk Kubernetes cluster in Amazon EKS, with an 8TB YCSB database.
We now have the ability to automatically create an FDB cluster in Kubernetes, load it with a YCSB data set, and then produce throughput/latency curves for each of the workloads. While the benchmarks are running we automatically gather perf flamegraphs for a subset of the processes. We try to obtain three profiles for each combination of roles in the system (so, if there are only two TLog processes that are also Coordinators, we’ll only get two “coordinator, tlog” profiles). Three is a nice number to grab, since it lets you tell which one is the outlier if the profiles don’t match.
The current setup has not been carefully tuned; I’ve been focused on getting the tooling to work reliably. However, I’d like to share these early results so that people can give feedback, and also to let the community know what we’re up to. For the workloads (but not the initial load), I run the workload with target=0 (unthrottled) to obtain FDB’s maximum throughput. Then, I run the same workload over and over again with target set to 10, 20, … 90 percent of that value. I generate plots of all of these runs to confirm that performance is stable and that we’re not suffering from warmup effects. I’ve included plots for “unthrottled”, “80% throughput” and throughput-latency curves here.
The most surprising result is the peak in the middle of each throughput/latency curve. Ideally, latencies would stay low until the system reaches ~ 100% throughput, and then rapidly increase because of queueing delay in FDB. Instead, we see peak latencies at 50-66% throughput with a gradual drop off as we approach 100%. This is likely due to some sort of batching or priority inversion bug. From what I can tell, most of these workloads are bound by proxy CPU. As a side effect of this issue, our tail latencies are ~ half a second for most of the experiments. That’s much higher than we’d like to see.
Since we automatically gather flamegraphs for each data point, and for each role. we can easily drill down into the above results, and look at profiles for the suspected role type during the bad behavior. Since we’re concerned about latency at medium throughputs, and suspect the proxies, here is a profile of a proxy during the 80% target throughput workload B data point. We can see that it spends most of its time in the Linux kernel, presumably sending and receiving small messages:
We could continue to debug this issue by looking at FDB logs, which include latency probes, or by focusing directly on throughput, and addressing the excessive CPU utilization of the networking code.
Over time, we plan to expand our testing to include more instrumentation, more workloads (and storage engines, like Redwood and RocksDB), and also to make various improvements to out test setup.
Finally, this work is designed to compliment our chaos testing project; we’ll write more about that later, but, for now, here’s a graph of throughput and latency during an HA failover event in our chaos testing environment. This graph is from Zhe Wu, who combined our HA fault injection setup with our YCSB benchmarking infrastructure. It shows the throughput of a small HA FDB cluster during an HA failover and recovery:
Now that our continuous integration environment can measure performance during such scenarios, we’ll be able to track performance improvements and screen for regressions.