Peeking Inside the JVM
Image by IndieWire

Peeking Inside the JVM

Have you ever found yourself pondering the inner workings of Java bytecode within the Java Virtual Machine (JVM)? While most Java developers are familiar with concepts like just-in-time compilation, the actual process can seem like a mysterious black box. So, what does it really entail? The answer, in short, is nothing short of fascinating.

Now, you might be wondering, "Is understanding this process truly essential?" For a large enterprise application where latency measured in multiple hundreds of milliseconds is acceptable, perhaps not. However, for those engaged in performance analysis, or for development teams striving to optimize every millisecond, if not every microsecond, such questions make much more sense.

Two powerful tools that help unravel this mystery: JMH and hsdis. Java Microbenchmark Harness (JMH) provides a framework for constructing precise microbenchmarks, while hsdis serves as a disassembler for Java bytecode.

For those interested in diving deeper, compiling the hsdis library can be an enlightening experience. Instructions for compilation can be found here: hsdis GitHub repository.

But fair warning: on my new Ubuntu v22 setup, the Capstone thing didn't want to play nice, and the LLVM version spat out some corrupted addresses. Only the binutils one behaved itself.

To illustrate the power of these tools, let's embark on a simple experiment: comparing the performance of a regular for loop against a for-each construct.

Let's start with this code and configuration

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 1, time = 20, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1, jvmArgsAppend = {"-XX:+UseParallelGC", "-XX:+AlwaysPreTouch", "-Xms4g", "-Xmx4g"})
@Threads(1)
public class LoopsBenchmark {
    private final int size = 1000;
    private List<Integer> list = new ArrayList<>(size);

    @Setup
    public void setup() {
        for (int i = 0; i < size; i++) {
            list.add(ThreadLocalRandom
                .current().nextInt(100_000_000));
        }
    }

    @Benchmark
    public int testForLoop() {
        int r = 0;
        for (int i = 0; i < size; i++) {
            r += list.get(i);
        }
        return r;
    }

    @Benchmark
    public int testForEachLoop() {
        int r = 0;
        for (Integer i : list) {
            r += i;
        }
        return r;
    }
}        

The results are certainly intriguing (AMD Ryzen 9 7900X, MSFT OpenJDK v21.0.2):

Benchmark                       Mode  Cnt  Score   Error  Units
LoopsBenchmark.testForEachLoop  avgt    5  0.277 ± 0.002  us/op
LoopsBenchmark.testForLoop      avgt    5  0.268 ± 0.001  us/op        

At first glance, it appears that the traditional for loop edges out the forEach loop in terms of speed. However, before we draw any conclusions, let's delve deeper into the findings with the help of hsdis.

One notable observation from the output is:

....[Hottest Methods (after inlining)]...........................
  80.25%     c2, level 4  testForEachLoop_avgt_jmhStub, version 5, compile id 799 
  18.98%     c2, level 4  testForEachLoop_avgt_jmhStub, version 4, compile id 798 
        

Here, we encounter two distinct versions of the hottest code. This challenges my initial assumptions about warmup time; it seems we need to give the JIT compiler more time to fully optimize the code.

After extending the warmup period, we observe only one version of the code. Additionally, it becomes evident that the JIT compiler has employed the technique of loop unrolling, where each cycle of the loop processes four elements rather than one. When aligning the assembly instructions and disregarding the misattribution of hotspots by one step by perf, the generated code appears identical.

Portion of the loop

So why, then, the difference in performance?

Upon inspecting the compiler annotations, it's apparent that check casts are consistently referenced. This indicates that the code is loading the classword and comparing it with the Integer classword. But what if we alter the method signature and accumulator to an Integer object? The assembly code is definitely different - check casts are eliminated. However, there's a significant drawback – the benchmark slows down almost sixfold.

Benchmark                           Mode  Cnt  Score   Error  Units
LoopsBenchmark.testForEachLoop      avgt    3  1.563 ± 0.017  us/op
LoopsBenchmark.testForLoop          avgt    3  1.590 ± 0.030  us/op        

Analysis reveals that loop unrolling no longer there. Instead, the JIT compiler has inlined the valueOf method into the benchmark code to sum values and code now look a bit different for both benchmarks.

Since it appears we've hit a dead end, let's backtrack to using primitive int and explore another intriguing optimization. According to Agner's optimization tables, the AMD Zen4 boasts a reciprocal throughput of 0.25 (source: Agner's instruction tables). Thus, we can potentially improve performance by employing multiple accumulators and manual loop unrolling by four, since perf attributed most of the cost to add operation. Here's how the revised code looks:

    @Benchmark
    public int testForLoop() {
        int r1 = 0;
        int r2 = 0;
        int r3 = 0;
        int r4 = 0;
        for (int i = 0; i < size; i = i + 4) {
            r1 += list.get(i);
            r2 += list.get(i + 1);
            r3 += list.get(i + 2);
            r4 += list.get(i + 3);
        }
        return r1 + r2 + r3 + r4;
    }        

However, to avoid handling the tail, we need to increase the size of the array to 1024. And, alas, we encounter an obstacle – the forEach loop cannot be manually unrolled. As expected, the forEach loop benchmark takes slightly more time. However, the regular for loop now outperforms its previous iteration.

Benchmark                           Mode  Cnt  Score   Error  Units
LoopsBenchmark.testForEachLoop      avgt    3  0.282 ± 0.001  us/op
LoopsBenchmark.testForLoop          avgt    3  0.246 ± 0.001  us/op        

Loop is unrolled so we can omit three out of four repeated blocks and instead show accumulators logic in the end

Manually unrolled loop with multiple accumulators

In conclusion, our exploration has yielded valuable insights. We've uncovered the optimizations applied by the JIT compiler in this scenario, gained interesting perspectives on object types, and confirmed the efficacy of leveraging reciprocal throughput to enhance the performance of specific operations.

Hope it helps with your endeavors.


要查看或添加评论,请登录

Aliaksei Dubrouski的更多文章

  • Thresholds Maze

    Thresholds Maze

    Introduction Deoptimization storms in the JIT compiler are not uncommon, and in large organizations, they might occur…

  • The Secret Life of Caches

    The Secret Life of Caches

    It was a crisp late autumn morning in the San Francisco Bay Area, the kind that makes engineers appreciate a good cup…

    8 条评论
  • Elusive Java Exception

    Elusive Java Exception

    One day, we received an email from the development team asking for help troubleshooting a perplexing exception. The…

    2 条评论
  • How to overflow an integer in a jiffy.

    How to overflow an integer in a jiffy.

    In the annals of scientific measurement, the concept of a "jiffy" stands as a testament to the rapidity of light…

    2 条评论
  • Vectorized Quick Sort Part 2

    Vectorized Quick Sort Part 2

    In my previous article, I explored a vectorized Quick Sort algorithm. To simplify things, I used a regular scalar sort…

  • Vectorized Quick Sort In JDK21

    Vectorized Quick Sort In JDK21

    This article explores the potential of the Vector API, introduced in JDK 21, to accelerate the classic QuickSort…

  • Pitfalls Of Code Generation

    Pitfalls Of Code Generation

    Fast Avro framework is the fastest serialization framework available for Java (at least in terms of deserialization…

    8 条评论
  • Diagnosing Performance After Linux Kernel Upgrade

    Diagnosing Performance After Linux Kernel Upgrade

    Development team responsible for large cache-like application reported significant performance regression after…

    2 条评论
  • JSSE vs BoringSSL for Java

    JSSE vs BoringSSL for Java

    A couple of years ago, we conducted an extensive research project comparing various implementations of the SSL stack…

  • Hunting Down Elusive Memory Issues in a Java Applications

    Hunting Down Elusive Memory Issues in a Java Applications

    Sometimes, figuring out what's causing a problem can feel like solving a tough puzzle. I encountered few issues that…

    1 条评论

社区洞察

其他会员也浏览了