Peeking Inside the JVM
Have you ever found yourself pondering the inner workings of Java bytecode within the Java Virtual Machine (JVM)? While most Java developers are familiar with concepts like just-in-time compilation, the actual process can seem like a mysterious black box. So, what does it really entail? The answer, in short, is nothing short of fascinating.
Now, you might be wondering, "Is understanding this process truly essential?" For a large enterprise application where latency measured in multiple hundreds of milliseconds is acceptable, perhaps not. However, for those engaged in performance analysis, or for development teams striving to optimize every millisecond, if not every microsecond, such questions make much more sense.
Two powerful tools that help unravel this mystery: JMH and hsdis. Java Microbenchmark Harness (JMH) provides a framework for constructing precise microbenchmarks, while hsdis serves as a disassembler for Java bytecode.
For those interested in diving deeper, compiling the hsdis library can be an enlightening experience. Instructions for compilation can be found here: hsdis GitHub repository.
But fair warning: on my new Ubuntu v22 setup, the Capstone thing didn't want to play nice, and the LLVM version spat out some corrupted addresses. Only the binutils one behaved itself.
To illustrate the power of these tools, let's embark on a simple experiment: comparing the performance of a regular for loop against a for-each construct.
Let's start with this code and configuration
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 1, time = 20, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1, jvmArgsAppend = {"-XX:+UseParallelGC", "-XX:+AlwaysPreTouch", "-Xms4g", "-Xmx4g"})
@Threads(1)
public class LoopsBenchmark {
private final int size = 1000;
private List<Integer> list = new ArrayList<>(size);
@Setup
public void setup() {
for (int i = 0; i < size; i++) {
list.add(ThreadLocalRandom
.current().nextInt(100_000_000));
}
}
@Benchmark
public int testForLoop() {
int r = 0;
for (int i = 0; i < size; i++) {
r += list.get(i);
}
return r;
}
@Benchmark
public int testForEachLoop() {
int r = 0;
for (Integer i : list) {
r += i;
}
return r;
}
}
The results are certainly intriguing (AMD Ryzen 9 7900X, MSFT OpenJDK v21.0.2):
Benchmark Mode Cnt Score Error Units
LoopsBenchmark.testForEachLoop avgt 5 0.277 ± 0.002 us/op
LoopsBenchmark.testForLoop avgt 5 0.268 ± 0.001 us/op
At first glance, it appears that the traditional for loop edges out the forEach loop in terms of speed. However, before we draw any conclusions, let's delve deeper into the findings with the help of hsdis.
One notable observation from the output is:
....[Hottest Methods (after inlining)]...........................
80.25% c2, level 4 testForEachLoop_avgt_jmhStub, version 5, compile id 799
18.98% c2, level 4 testForEachLoop_avgt_jmhStub, version 4, compile id 798
Here, we encounter two distinct versions of the hottest code. This challenges my initial assumptions about warmup time; it seems we need to give the JIT compiler more time to fully optimize the code.
After extending the warmup period, we observe only one version of the code. Additionally, it becomes evident that the JIT compiler has employed the technique of loop unrolling, where each cycle of the loop processes four elements rather than one. When aligning the assembly instructions and disregarding the misattribution of hotspots by one step by perf, the generated code appears identical.
领英推è
So why, then, the difference in performance?
Upon inspecting the compiler annotations, it's apparent that check casts are consistently referenced. This indicates that the code is loading the classword and comparing it with the Integer classword. But what if we alter the method signature and accumulator to an Integer object? The assembly code is definitely different - check casts are eliminated. However, there's a significant drawback – the benchmark slows down almost sixfold.
Benchmark Mode Cnt Score Error Units
LoopsBenchmark.testForEachLoop avgt 3 1.563 ± 0.017 us/op
LoopsBenchmark.testForLoop avgt 3 1.590 ± 0.030 us/op
Analysis reveals that loop unrolling no longer there. Instead, the JIT compiler has inlined the valueOf method into the benchmark code to sum values and code now look a bit different for both benchmarks.
Since it appears we've hit a dead end, let's backtrack to using primitive int and explore another intriguing optimization. According to Agner's optimization tables, the AMD Zen4 boasts a reciprocal throughput of 0.25 (source: Agner's instruction tables). Thus, we can potentially improve performance by employing multiple accumulators and manual loop unrolling by four, since perf attributed most of the cost to add operation. Here's how the revised code looks:
@Benchmark
public int testForLoop() {
int r1 = 0;
int r2 = 0;
int r3 = 0;
int r4 = 0;
for (int i = 0; i < size; i = i + 4) {
r1 += list.get(i);
r2 += list.get(i + 1);
r3 += list.get(i + 2);
r4 += list.get(i + 3);
}
return r1 + r2 + r3 + r4;
}
However, to avoid handling the tail, we need to increase the size of the array to 1024. And, alas, we encounter an obstacle – the forEach loop cannot be manually unrolled. As expected, the forEach loop benchmark takes slightly more time. However, the regular for loop now outperforms its previous iteration.
Benchmark Mode Cnt Score Error Units
LoopsBenchmark.testForEachLoop avgt 3 0.282 ± 0.001 us/op
LoopsBenchmark.testForLoop avgt 3 0.246 ± 0.001 us/op
Loop is unrolled so we can omit three out of four repeated blocks and instead show accumulators logic in the end
In conclusion, our exploration has yielded valuable insights. We've uncovered the optimizations applied by the JIT compiler in this scenario, gained interesting perspectives on object types, and confirmed the efficacy of leveraging reciprocal throughput to enhance the performance of specific operations.
Hope it helps with your endeavors.