JSSE vs BoringSSL for Java
A couple of years ago, we conducted an extensive research project comparing various implementations of the SSL stack for Java. Our primary focus was on comparing the native implementation in OpenJDK with Google's BoringSSL. Both implementations use a number of ASM intrinsics, and our goal was to understand how they would perform within our environment. To achieve this, we developed a specialized application with minimal business logic resembling our regular applications and designed to stress test the SSL stack.
The first experiment was performed using OpenJDK 11.0.8 and result was quite fascinating. BoringSSL outperformed JSSE by a large margin while saturating 4 CPU core container at much higher queries per second (QPS) rate.
Upon profiling, we discovered an interesting distinction: due to the benchmarking being executed on older servers from our test pool, JSSE relied solely on a pure Java implementation, whereas BoringSSL utilized ASM intrinsics even for CPUs lacking AES support.
For our second experiment, we replaced the test servers with ones that supported AES. Despite this change, BoringSSL continued to demonstrate slightly better performance with lower CPU usage.
Profiling showed that JSSE demands more CPU. And what's interesting it uses number of separate intrinsics for different functions like:
BoringSSL intrinsics could be found here:
Perf analysis showed some differences in behavior of these two implementations under the same level of stress:
JSSE
531059.591354 task-clock (msec) # 3.660 CPUs utilized
383,765 context-switches # 0.723 K/sec
14,445 cpu-migrations # 0.027 K/sec
5,444 page-faults # 0.010 K/sec
1,538,866,827,388 cycles # 2.898 GHz
1,533,535,705,304 instructions # 1.00 insn per cycle
238,418,753,708 branches # 448.949 M/sec
3,528,489,566 branch-misses # 1.48% of all branches
407,662,417,708 L1-dcache-loads # 767.640 M/sec
38,993,722,335 L1-dcache-load-misses # 9.57% of all L1-dcache hits
2,668,985,697 LLC-loads # 5.026 M/sec
139,203,291 LLC-load-misses # 5.22% of all LL-cache hits
BoringSSL
领英推荐
496268.032497 task-clock (msec) # 3.414 CPUs utilized
730,270 context-switches # 0.001 M/sec
22,751 cpu-migrations # 0.046 K/sec
1,820 page-faults # 0.004 K/sec
1,417,815,098,625 cycles # 2.857 GHz
1,327,769,512,850 instructions # 0.94 insn per cycle
223,769,906,085 branches # 450.905 M/sec
3,603,222,973 branch-misses # 1.61% of all branches
400,508,230,047 L1-dcache-loads # 807.040 M/sec
36,881,390,647 L1-dcache-load-misses # 9.21% of all L1-dcache hits
2,422,680,228 LLC-loads # 4.882 M/sec
94,921,348 LLC-load-misses # 3.92% of all LL-cache hits
BoringSSL implementation uses less CPU cycles and slightly more effective with cache usage (better throughput and less misses)
I also found another interesting anomaly by comparing throughput of FP instructions
JSSE
30,070,913 fp_arith_inst_retired.scalar_single
3,688,641 fp_arith_inst_retired.scalar_double
BoringSSL
30,044,099 fp_arith_inst_retired.scalar_single
2,329,832 fp_arith_inst_retired.scalar_double
So the throughput of the scalar versions of SSE instructions for BoringSSL implementation is lower. But since absolute throughput numbers (over 140s elapsed) are very low this could be pretty much ignored.
We also tried to experiment with JDK 11.0.12 and "-XX:UseAVX=3" in view of that patch but it requires at least Ice Lake CPU to test on.
BoringSSL won by a small margin.
We found the biggest difference in behavior for larger payloads.
Since the time of this research few performance patches were merged to JDK code i.e. this, so results of this comparison could be outdated.
Thanks Vivek Deshpande and Tyler Horth for the help with this research.