ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Peeking Inside the JVM

Aliaksei Dubrouski

Sr Staff Software Engineer at LinkedIn

å‘å¸ƒæ—¥æœŸ: 2024å¹´3æœˆ17æ—¥

Have you ever found yourself pondering the inner workings of Java bytecode within the Java Virtual Machine (JVM)? While most Java developers are familiar with concepts like just-in-time compilation, the actual process can seem like a mysterious black box. So, what does it really entail? The answer, in short, is nothing short of fascinating.

Now, you might be wondering, "Is understanding this process truly essential?" For a large enterprise application where latency measured in multiple hundreds of milliseconds is acceptable, perhaps not. However, for those engaged in performance analysis, or for development teams striving to optimize every millisecond, if not every microsecond, such questions make much more sense.

Two powerful tools that help unravel this mystery: JMH and hsdis. Java Microbenchmark Harness (JMH) provides a framework for constructing precise microbenchmarks, while hsdis serves as a disassembler for Java bytecode.

For those interested in diving deeper, compiling the hsdis library can be an enlightening experience. Instructions for compilation can be found here: hsdis GitHub repository.

But fair warning: on my new Ubuntu v22 setup, the Capstone thing didn't want to play nice, and the LLVM version spat out some corrupted addresses. Only the binutils one behaved itself.

To illustrate the power of these tools, let's embark on a simple experiment: comparing the performance of a regular for loop against a for-each construct.

Let's start with this code and configuration

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 1, time = 20, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1, jvmArgsAppend = {"-XX:+UseParallelGC", "-XX:+AlwaysPreTouch", "-Xms4g", "-Xmx4g"})
@Threads(1)
public class LoopsBenchmark {
    private final int size = 1000;
    private List<Integer> list = new ArrayList<>(size);

    @Setup
    public void setup() {
        for (int i = 0; i < size; i++) {
            list.add(ThreadLocalRandom
                .current().nextInt(100_000_000));
        }
    }

    @Benchmark
    public int testForLoop() {
        int r = 0;
        for (int i = 0; i < size; i++) {
            r += list.get(i);
        }
        return r;
    }

    @Benchmark
    public int testForEachLoop() {
        int r = 0;
        for (Integer i : list) {
            r += i;
        }
        return r;
    }
}

The results are certainly intriguing (AMD Ryzen 9 7900X, MSFT OpenJDK v21.0.2):

Benchmark                       Mode  Cnt  Score   Error  Units
LoopsBenchmark.testForEachLoop  avgt    5  0.277 Â± 0.002  us/op
LoopsBenchmark.testForLoop      avgt    5  0.268 Â± 0.001  us/op

At first glance, it appears that the traditional for loop edges out the forEach loop in terms of speed. However, before we draw any conclusions, let's delve deeper into the findings with the help of hsdis.

One notable observation from the output is:

....[Hottest Methods (after inlining)]...........................
  80.25%     c2, level 4  testForEachLoop_avgt_jmhStub, version 5, compile id 799 
  18.98%     c2, level 4  testForEachLoop_avgt_jmhStub, version 4, compile id 798

Here, we encounter two distinct versions of the hottest code. This challenges my initial assumptions about warmup time; it seems we need to give the JIT compiler more time to fully optimize the code.

After extending the warmup period, we observe only one version of the code. Additionally, it becomes evident that the JIT compiler has employed the technique of loop unrolling, where each cycle of the loop processes four elements rather than one. When aligning the assembly instructions and disregarding the misattribution of hotspots by one step by perf, the generated code appears identical.

é¢†è‹±æŽ¨è

Java 21 - Virtual Threads

Sanjoy Kumar Malik . 5 ä¸ªæœˆå‰

A categorized list of all Java and JVM features since JDK 8 to 17

A categorized list of all Java and JVM features sinceâ€¦

Omar Ismail 3 å¹´å‰

Unlocking Javaâ€™s Hidden Secrets: Surprising Facts Even Experts Might Not Know

Unlocking Javaâ€™s Hidden Secrets: Surprising Facts Evenâ€¦

Omar Ismail 2 å‘¨å‰

So why, then, the difference in performance?

Upon inspecting the compiler annotations, it's apparent that check casts are consistently referenced. This indicates that the code is loading the classword and comparing it with the Integer classword. But what if we alter the method signature and accumulator to an Integer object? The assembly code is definitely different - check casts are eliminated. However, there's a significant drawback â€“ the benchmark slows down almost sixfold.

Benchmark                           Mode  Cnt  Score   Error  Units
LoopsBenchmark.testForEachLoop      avgt    3  1.563 Â± 0.017  us/op
LoopsBenchmark.testForLoop          avgt    3  1.590 Â± 0.030  us/op

Analysis reveals that loop unrolling no longer there. Instead, the JIT compiler has inlined the valueOf method into the benchmark code to sum values and code now look a bit different for both benchmarks.

Since it appears we've hit a dead end, let's backtrack to using primitive int and explore another intriguing optimization. According to Agner's optimization tables, the AMD Zen4 boasts a reciprocal throughput of 0.25 (source: Agner's instruction tables). Thus, we can potentially improve performance by employing multiple accumulators and manual loop unrolling by four, since perf attributed most of the cost to add operation. Here's how the revised code looks:

    @Benchmark
    public int testForLoop() {
        int r1 = 0;
        int r2 = 0;
        int r3 = 0;
        int r4 = 0;
        for (int i = 0; i < size; i = i + 4) {
            r1 += list.get(i);
            r2 += list.get(i + 1);
            r3 += list.get(i + 2);
            r4 += list.get(i + 3);
        }
        return r1 + r2 + r3 + r4;
    }

However, to avoid handling the tail, we need to increase the size of the array to 1024. And, alas, we encounter an obstacle â€“ the forEach loop cannot be manually unrolled. As expected, the forEach loop benchmark takes slightly more time. However, the regular for loop now outperforms its previous iteration.

Benchmark                           Mode  Cnt  Score   Error  Units
LoopsBenchmark.testForEachLoop      avgt    3  0.282 Â± 0.001  us/op
LoopsBenchmark.testForLoop          avgt    3  0.246 Â± 0.001  us/op

Loop is unrolled so we can omit three out of four repeated blocks and instead show accumulators logic in the end

Manually unrolled loop with multiple accumulators

In conclusion, our exploration has yielded valuable insights. We've uncovered the optimizations applied by the JIT compiler in this scenario, gained interesting perspectives on object types, and confirmed the efficacy of leveraging reciprocal throughput to enhance the performance of specific operations.

Hope it helps with your endeavors.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Aliaksei Dubrouskiçš„æ›´å¤šæ–‡ç«

Thresholds Maze

2025å¹´3æœˆ15æ—¥

Thresholds Maze

Introduction Deoptimization storms in the JIT compiler are not uncommon, and in large organizations, they might occurâ€¦
The Secret Life of Caches

2025å¹´2æœˆ8æ—¥

The Secret Life of Caches

It was a crisp late autumn morning in the San Francisco Bay Area, the kind that makes engineers appreciate a good cupâ€¦

8 æ¡è¯„è®º
Elusive Java Exception

2024å¹´6æœˆ17æ—¥

Elusive Java Exception

One day, we received an email from the development team asking for help troubleshooting a perplexing exception. Theâ€¦

2 æ¡è¯„è®º
How to overflow an integer in a jiffy.

2024å¹´5æœˆ12æ—¥

How to overflow an integer in a jiffy.

In the annals of scientific measurement, the concept of a "jiffy" stands as a testament to the rapidity of lightâ€¦

2 æ¡è¯„è®º
Vectorized Quick Sort Part 2

2024å¹´5æœˆ5æ—¥

Vectorized Quick Sort Part 2

In my previous article, I explored a vectorized Quick Sort algorithm. To simplify things, I used a regular scalar sortâ€¦
Vectorized Quick Sort In JDK21

2024å¹´4æœˆ28æ—¥

Vectorized Quick Sort In JDK21

This article explores the potential of the Vector API, introduced in JDK 21, to accelerate the classic QuickSortâ€¦
Pitfalls Of Code Generation

2024å¹´4æœˆ21æ—¥

Pitfalls Of Code Generation

Fast Avro framework is the fastest serialization framework available for Java (at least in terms of deserializationâ€¦

8 æ¡è¯„è®º
Diagnosing Performance After Linux Kernel Upgrade

2024å¹´4æœˆ14æ—¥

Diagnosing Performance After Linux Kernel Upgrade

Development team responsible for large cache-like application reported significant performance regression afterâ€¦

2 æ¡è¯„è®º
JSSE vs BoringSSL for Java

2024å¹´4æœˆ7æ—¥

JSSE vs BoringSSL for Java

A couple of years ago, we conducted an extensive research project comparing various implementations of the SSL stackâ€¦
Hunting Down Elusive Memory Issues in a Java Applications

2024å¹´3æœˆ31æ—¥

Hunting Down Elusive Memory Issues in a Java Applications

Sometimes, figuring out what's causing a problem can feel like solving a tough puzzle. I encountered few issues thatâ€¦

1 æ¡è¯„è®º

See all articles

Peeking Inside the JVM

Aliaksei Dubrouski

Sr Staff Software Engineer at LinkedIn

é¢†è‹±æŽ¨è

Aliaksei Dubrouskiçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Java history from Java 1.0 to Java 18

Feature Freeze for JDK 22: What Will the New Edition Bring? - JVM Weekly vol. 63

Java Record

Java 8 - New Features

Understanding the Java Memory Model (JMM)

New Features in Java 17

Everyone Should Know JVM POWER JAVA

Exploring Virtual Threads Java 21

Lombok: The Good, The Bad, and The Controversial

Java 8 Lambdas&Functional Interface.

é¢†è‹±æŽ¨è

Aliaksei Dubrouskiçš„æ›´å¤šæ–‡ç«

Thresholds Maze

The Secret Life of Caches

Elusive Java Exception

How to overflow an integer in a jiffy.

Vectorized Quick Sort Part 2

Vectorized Quick Sort In JDK21

Pitfalls Of Code Generation

Diagnosing Performance After Linux Kernel Upgrade

JSSE vs BoringSSL for Java

Hunting Down Elusive Memory Issues in a Java Applications

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Java history from Java 1.0 to Java 18

Feature Freeze for JDK 22: What Will the New Edition Bring? - JVM Weekly vol. 63

Java Record

Java 8 - New Features

Understanding the Java Memory Model (JMM)

New Features in Java 17

Everyone Should Know JVM POWER JAVA

Exploring Virtual Threads Java 21

Lombok: The Good, The Bad, and The Controversial

Java 8 Lambdas&Functional Interface.

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†