Apache Cassandra 4.0: Taming Tail Latencies with Java 16 ZGC

Like so many others in the Apache Cassandra community, I’m extremely excited to see that the 4.0 release is finally here. There are many, many improvements to Cassandra 4.0. One enhancement that is more important than it might look is the addition of support for Java versions 9 and up. This was not trivial, because Java 9 made changes to some internal APIs that the most performance-oriented Java projects like Cassandra relied on (you can read more about this here).

This is a big deal because with Cassandra 4.0, you not only get the direct improvements to performance added by the Apache Cassandra committers, you also unlock the ability to take advantage of seven years of improvements in the JVM (Java Virtual Machine) itself.

Here, I’d like to focus on improvements in Java garbage collection that Cassandra 4.0 coupled with Java 16 offers over Cassandra 3.11 on Java 8.

The garbage collection challenge

In 2012, I gave a talk titled, “Dealing with JVM Limitations in Apache Cassandra.” Here is the first slide from that presentation:

On the one hand, garbage collection is a primary reason that Java is so much more productive than traditional systems languages like C++. As JVM architect Cliff Click once wrote, “Many concurrent algorithms are very easy to write with a GC and totally hard to downright impossible using explicit free.” Cassandra takes full advantage of this power.

But performing garbage collection means having to briefly pause the JVM to determine which objects are no longer in use and can safely be disposed of. These GC pauses can cause delayed response times to client requests, i.e., increased latencies.

Not all requests are affected by this–only the handful of requests that are in flight while Cassandra’s request-handling threads are paused for the GC. The performance impact is thus only visible in tail latencies, that is, the 99th percentile or 99.9th percentile measurements, corresponding to the slowest 1% or 0.1% of requests.

As with so many things, optimizing GC involves tradeoffs, and the original Java GC designs focused more on improving throughput than on reducing pause times. Fast forward to 2021 and we have common server-class CPUs with 64 cores/128 threads—we have plenty of throughput on tap. It’s time to spend some of those cycles on lower pause times.

The Z Garbage Collector (ZGC) was created to address this situation, and specifically to guarantee pause times under 10ms. ZGC was added to Java 11 as an experimental feature, promoted to production in Java 15, and further improved in Java 16.

To show how well ZGC improves Cassandra performance, we compared both throughput and latency in three environments: Cassandra 3.11 running on JDK 8 with its default CMS GC settings, Cassandra 4.0 running on JDK 8 with the same settings, and Cassandra 4.0 running on JDK 16 with ZGC. I’m pleased to report that ZGC convincingly achieves its design goals, allowing Cassandra to deliver nearly-constant latencies through the 99th percentile, with only a modest uptick at the 99.9th percentile!

ZGC performance results

My colleague Jonathan Shook benchmarked the performance characteristics of Cassandra 3.11 and 4.0 in detail across three workloads: simple key/value, a time series workload with many rows per partition, and a tabular workload with one row per partition but many columns per row.

Throughput results

Here we are looking at Cassandra running at 70% of maximum throughput. This leaves 30% operational headroom to absorb compaction, repair, or load spikes for the purposes of realistic measurements.

Cassandra 4.0 running with the same configuration as Cassandra 3.11 is 30% faster in the key/value workload, 2% slower in the time series workload, and 10% faster in the tabular workload. Turning on ZGC unlocks an additional 30% more throughput for key/value and time series workloads, but has no effect on the tabular workload.

Latency results

I’ve split the latency results into one chart per workload so it’s easier to see the trends across the different percentiles:

For these results, we limited each test scenario to the slowest system’s throughput, i.e., we used 30,000, 44,000, and 54,000 requests per second for the key/value, time series, and tabular workloads, respectively.

Cassandra 4.0’s latencies are virtually identical to 3.11’s with the same GC settings, but ZGC is consistently better, up to a solid factor of 5 to 10 better at p99 and p999 percentiles.

The NoSQLBench performance testing suite

Most benchmarks of non-relational databases are done with either product-specific tooling (like cassandra-stress), or with YCSB, which gives you a lowest-common-denominator key-value workload across dozens of systems.

Jonathan Shook created NoSQLBench to be a cross-platform performance testing tool that is easier to use than cassandra-stress and (much) more powerful than YCSB; in fact, its scripting layer is powerful enough to support things that no other testing tool can enable, with particular emphasis on modeling complex workloads with fidelity, as well as simulating realistic scenarios such as load spikes. As its name suggests, NoSQLBench is not Cassandra-specific and encourages participation from all who want to contribute; today there are clients for Cassandra, CockroachDB, JDBC, and MongoDB, as well as non-database products Kafka and Pulsar. If you’re serious about performance testing in 2021, you should check out NoSQLBench. You can get started at GitHub. Other useful links: releases, discord, docs.

The NoSQLBench workload descriptions for the tests in this post can be found here.

Conclusion

Without switching to ZGC, Cassandra 4.0 offers modest but real throughput improvements for key/value and tabular workloads.

Combining Cassandra 4.0 with ZGC in Java 16 results in further improvements to throughput for key/value and time series workloads as well as convincingly demonstrating ZGC’s design goals to make GC pause time a non-issue across all tested workloads for Cassandra 4.0.

ZGC is production-ready starting with Java 15; for enterprises that want to stick with LTS releases, ZGC will be one of the headlining reasons to upgrade to the Java 17 LTS release later this year. ZGC is one of the most significant performance “free lunches” available, and it Just Works—the results shown here are out-of-the-box for ZGC with no extra tuning.

Appendix: Test environment

All tests were run on the same physical cluster of AWS i3.4xl nodes: 16 vCPUs, 122GB RAM, 10Gb network, 5 nodes in the cluster. Storage was configured as XFS on direct NVMe, single volume. All data was stored at RF3. Assigned tokens were used to ensure consistent data distribution across the tested versions. Consistency level for all operations was set as LOCAL_QUORUM. Concurrency from the client side was set at 960 (20x client cores) for the keyvalue test, and 480 (10x client cores) for the time-series and tabular tests. All measurements were taken from the client, and include duration between submitting and fully reading any data in results. All measurements were taken with 3 significant digits of precision, then rounded to the nearest ms. ZGC was configured with basic recommended settings: 16GB min heap, 64GB max heap, large pages enabled. The other numbers are using Cassandra’s out-of-the-box configuration with CMS.

32