登录查看更多内容

JSSE vs BoringSSL for Java

Aliaksei Dubrouski

Sr Staff Software Engineer at LinkedIn

发布日期: 2024年4月7日

A couple of years ago, we conducted an extensive research project comparing various implementations of the SSL stack for Java. Our primary focus was on comparing the native implementation in OpenJDK with Google's BoringSSL. Both implementations use a number of ASM intrinsics, and our goal was to understand how they would perform within our environment. To achieve this, we developed a specialized application with minimal business logic resembling our regular applications and designed to stress test the SSL stack.

The first experiment was performed using OpenJDK 11.0.8 and result was quite fascinating. BoringSSL outperformed JSSE by a large margin while saturating 4 CPU core container at much higher queries per second (QPS) rate.

Upon profiling, we discovered an interesting distinction: due to the benchmarking being executed on older servers from our test pool, JSSE relied solely on a pure Java implementation, whereas BoringSSL utilized ASM intrinsics even for CPUs lacking AES support.

For our second experiment, we replaced the test servers with ones that supported AES. Despite this change, BoringSSL continued to demonstrate slightly better performance with lower CPU usage.

HTTP1.1 (H1) and HTTP2 (H2) with 4k and 128k payloads

Profiling showed that JSSE demands more CPU. And what's interesting it uses number of separate intrinsics for different functions like:

BoringSSL intrinsics could be found here:

Perf analysis showed some differences in behavior of these two implementations under the same level of stress:

JSSE

     531059.591354      task-clock (msec)   #  3.660 CPUs utilized          
           383,765      context-switches    #  0.723 K/sec                  
            14,445      cpu-migrations      #  0.027 K/sec                  
             5,444      page-faults         #  0.010 K/sec                  
 1,538,866,827,388      cycles              #  2.898 GHz
 1,533,535,705,304      instructions        #  1.00  insn per cycle           
   238,418,753,708      branches            #  448.949 M/sec                    
     3,528,489,566      branch-misses       #  1.48% of all branches          
   407,662,417,708      L1-dcache-loads     #  767.640 M/sec                    
    38,993,722,335      L1-dcache-load-misses # 9.57% of all L1-dcache hits
     2,668,985,697      LLC-loads           # 5.026 M/sec                    
       139,203,291      LLC-load-misses     # 5.22% of all LL-cache hits

BoringSSL

领英推荐

Java 21 - Sequenced Collections

Sanjoy Kumar Malik . 5 个月前

One Billion Row Challenge in Java: Part 4 - in Less…

Saeed Anabtawi 1 年前

New Features in Java 17

Manoj Kumar 2 年前

     496268.032497      task-clock (msec)   # 3.414 CPUs utilized          
           730,270      context-switches    # 0.001 M/sec                  
            22,751      cpu-migrations      # 0.046 K/sec                  
             1,820      page-faults         # 0.004 K/sec                  
 1,417,815,098,625      cycles              # 2.857 GHz                      
 1,327,769,512,850      instructions        # 0.94  insn per cycle
   223,769,906,085      branches            # 450.905 M/sec                    
     3,603,222,973      branch-misses       # 1.61% of all branches
   400,508,230,047      L1-dcache-loads     # 807.040 M/sec                    
    36,881,390,647      L1-dcache-load-misses # 9.21% of all L1-dcache hits
     2,422,680,228      LLC-loads            # 4.882 M/sec                    
        94,921,348      LLC-load-misses      # 3.92% of all LL-cache hits

BoringSSL implementation uses less CPU cycles and slightly more effective with cache usage (better throughput and less misses)

I also found another interesting anomaly by comparing throughput of FP instructions

JSSE

        30,070,913      fp_arith_inst_retired.scalar_single                                   
         3,688,641      fp_arith_inst_retired.scalar_double

BoringSSL

        30,044,099      fp_arith_inst_retired.scalar_single                                   
         2,329,832      fp_arith_inst_retired.scalar_double

So the throughput of the scalar versions of SSE instructions for BoringSSL implementation is lower. But since absolute throughput numbers (over 140s elapsed) are very low this could be pretty much ignored.

We also tried to experiment with JDK 11.0.12 and "-XX:UseAVX=3" in view of that patch but it requires at least Ice Lake CPU to test on.

BoringSSL won by a small margin.

We found the biggest difference in behavior for larger payloads.

Since the time of this research few performance patches were merged to JDK code i.e. this, so results of this comparison could be outdated.

Thanks Vivek Deshpande and Tyler Horth for the help with this research.

要查看或添加评论，请登录

Aliaksei Dubrouski的更多文章

Thresholds Maze

2025年3月15日

Thresholds Maze

Introduction Deoptimization storms in the JIT compiler are not uncommon, and in large organizations, they might occur…
The Secret Life of Caches

2025年2月8日

The Secret Life of Caches

It was a crisp late autumn morning in the San Francisco Bay Area, the kind that makes engineers appreciate a good cup…

8 条评论
Elusive Java Exception

2024年6月17日

Elusive Java Exception

One day, we received an email from the development team asking for help troubleshooting a perplexing exception. The…

2 条评论
How to overflow an integer in a jiffy.

2024年5月12日

How to overflow an integer in a jiffy.

In the annals of scientific measurement, the concept of a "jiffy" stands as a testament to the rapidity of light…

2 条评论
Vectorized Quick Sort Part 2

2024年5月5日

Vectorized Quick Sort Part 2

In my previous article, I explored a vectorized Quick Sort algorithm. To simplify things, I used a regular scalar sort…
Vectorized Quick Sort In JDK21

2024年4月28日

Vectorized Quick Sort In JDK21

This article explores the potential of the Vector API, introduced in JDK 21, to accelerate the classic QuickSort…
Pitfalls Of Code Generation

2024年4月21日

Pitfalls Of Code Generation

Fast Avro framework is the fastest serialization framework available for Java (at least in terms of deserialization…

8 条评论
Diagnosing Performance After Linux Kernel Upgrade

2024年4月14日

Diagnosing Performance After Linux Kernel Upgrade

Development team responsible for large cache-like application reported significant performance regression after…

2 条评论
Hunting Down Elusive Memory Issues in a Java Applications

2024年3月31日

Hunting Down Elusive Memory Issues in a Java Applications

Sometimes, figuring out what's causing a problem can feel like solving a tough puzzle. I encountered few issues that…

1 条评论
Digging Inside the JVM

2024年3月24日

Digging Inside the JVM

Building upon the insights from our previous discussion, let's dig deeper into the techniques employed by the JIT…

2 条评论

See all articles

JSSE vs BoringSSL for Java

Aliaksei Dubrouski

Sr Staff Software Engineer at LinkedIn

领英推荐

Aliaksei Dubrouski的更多文章

社区洞察

其他会员也浏览了

Bucket4j: The Ultimate Java Library for Rate-Limiting

Analysis of Memory Leak in Java Applications via Heap?Dump

Introduction to Java 15

Floating Point Numbers in Java

Concurrency in Java

Java Virtual Threads

Java memory management

What to expect in Java 18

Java 17 New Features Coming in September 2021

领英推荐

Aliaksei Dubrouski的更多文章

Thresholds Maze

The Secret Life of Caches

Elusive Java Exception

How to overflow an integer in a jiffy.

Vectorized Quick Sort Part 2

Vectorized Quick Sort In JDK21

Pitfalls Of Code Generation

Diagnosing Performance After Linux Kernel Upgrade

Hunting Down Elusive Memory Issues in a Java Applications

Digging Inside the JVM

社区洞察

其他会员也浏览了

Bucket4j: The Ultimate Java Library for Rate-Limiting

Analysis of Memory Leak in Java Applications via Heap?Dump

Introduction to Java 15

Floating Point Numbers in Java

Concurrency in Java

Java Virtual Threads

Java memory management

What to expect in Java 18

Java 17 New Features Coming in September 2021