JSSE vs BoringSSL for Java

JSSE vs BoringSSL for Java

A couple of years ago, we conducted an extensive research project comparing various implementations of the SSL stack for Java. Our primary focus was on comparing the native implementation in OpenJDK with Google's BoringSSL. Both implementations use a number of ASM intrinsics, and our goal was to understand how they would perform within our environment. To achieve this, we developed a specialized application with minimal business logic resembling our regular applications and designed to stress test the SSL stack.

The first experiment was performed using OpenJDK 11.0.8 and result was quite fascinating. BoringSSL outperformed JSSE by a large margin while saturating 4 CPU core container at much higher queries per second (QPS) rate.

HTTP1.1 (H1) with 4k and 128k payloads
HTTP1.1 (H1) with 4k and 128k payloads

Upon profiling, we discovered an interesting distinction: due to the benchmarking being executed on older servers from our test pool, JSSE relied solely on a pure Java implementation, whereas BoringSSL utilized ASM intrinsics even for CPUs lacking AES support.

For our second experiment, we replaced the test servers with ones that supported AES. Despite this change, BoringSSL continued to demonstrate slightly better performance with lower CPU usage.

HTTP1.1 (H1) and HTTP2 (H2) with 4k and 128k payloads
HTTP1.1 (H1) and HTTP2 (H2) with 4k and 128k payloads
HTTP1.1 (H1) and HTTP2 (H2) with 4k and 128k payloads

Profiling showed that JSSE demands more CPU. And what's interesting it uses number of separate intrinsics for different functions like:

BoringSSL intrinsics could be found here:

Perf analysis showed some differences in behavior of these two implementations under the same level of stress:

JSSE

     531059.591354      task-clock (msec)   #  3.660 CPUs utilized          
           383,765      context-switches    #  0.723 K/sec                  
            14,445      cpu-migrations      #  0.027 K/sec                  
             5,444      page-faults         #  0.010 K/sec                  
 1,538,866,827,388      cycles              #  2.898 GHz
 1,533,535,705,304      instructions        #  1.00  insn per cycle           
   238,418,753,708      branches            #  448.949 M/sec                    
     3,528,489,566      branch-misses       #  1.48% of all branches          
   407,662,417,708      L1-dcache-loads     #  767.640 M/sec                    
    38,993,722,335      L1-dcache-load-misses # 9.57% of all L1-dcache hits
     2,668,985,697      LLC-loads           # 5.026 M/sec                    
       139,203,291      LLC-load-misses     # 5.22% of all LL-cache hits        

BoringSSL

     496268.032497      task-clock (msec)   # 3.414 CPUs utilized          
           730,270      context-switches    # 0.001 M/sec                  
            22,751      cpu-migrations      # 0.046 K/sec                  
             1,820      page-faults         # 0.004 K/sec                  
 1,417,815,098,625      cycles              # 2.857 GHz                      
 1,327,769,512,850      instructions        # 0.94  insn per cycle
   223,769,906,085      branches            # 450.905 M/sec                    
     3,603,222,973      branch-misses       # 1.61% of all branches
   400,508,230,047      L1-dcache-loads     # 807.040 M/sec                    
    36,881,390,647      L1-dcache-load-misses # 9.21% of all L1-dcache hits
     2,422,680,228      LLC-loads            # 4.882 M/sec                    
        94,921,348      LLC-load-misses      # 3.92% of all LL-cache hits             

BoringSSL implementation uses less CPU cycles and slightly more effective with cache usage (better throughput and less misses)

I also found another interesting anomaly by comparing throughput of FP instructions

JSSE

        30,070,913      fp_arith_inst_retired.scalar_single                                   
         3,688,641      fp_arith_inst_retired.scalar_double                                           

BoringSSL

        30,044,099      fp_arith_inst_retired.scalar_single                                   
         2,329,832      fp_arith_inst_retired.scalar_double                                           

So the throughput of the scalar versions of SSE instructions for BoringSSL implementation is lower. But since absolute throughput numbers (over 140s elapsed) are very low this could be pretty much ignored.

We also tried to experiment with JDK 11.0.12 and "-XX:UseAVX=3" in view of that patch but it requires at least Ice Lake CPU to test on.

BoringSSL won by a small margin.

QPS
CPU Usage
QPS
CPU Usage

We found the biggest difference in behavior for larger payloads.

QPS
CPU Usage

Since the time of this research few performance patches were merged to JDK code i.e. this, so results of this comparison could be outdated.

Thanks Vivek Deshpande and Tyler Horth for the help with this research.


要查看或添加评论,请登录

Aliaksei Dubrouski的更多文章

  • Thresholds Maze

    Thresholds Maze

    Introduction Deoptimization storms in the JIT compiler are not uncommon, and in large organizations, they might occur…

  • The Secret Life of Caches

    The Secret Life of Caches

    It was a crisp late autumn morning in the San Francisco Bay Area, the kind that makes engineers appreciate a good cup…

    8 条评论
  • Elusive Java Exception

    Elusive Java Exception

    One day, we received an email from the development team asking for help troubleshooting a perplexing exception. The…

    2 条评论
  • How to overflow an integer in a jiffy.

    How to overflow an integer in a jiffy.

    In the annals of scientific measurement, the concept of a "jiffy" stands as a testament to the rapidity of light…

    2 条评论
  • Vectorized Quick Sort Part 2

    Vectorized Quick Sort Part 2

    In my previous article, I explored a vectorized Quick Sort algorithm. To simplify things, I used a regular scalar sort…

  • Vectorized Quick Sort In JDK21

    Vectorized Quick Sort In JDK21

    This article explores the potential of the Vector API, introduced in JDK 21, to accelerate the classic QuickSort…

  • Pitfalls Of Code Generation

    Pitfalls Of Code Generation

    Fast Avro framework is the fastest serialization framework available for Java (at least in terms of deserialization…

    8 条评论
  • Diagnosing Performance After Linux Kernel Upgrade

    Diagnosing Performance After Linux Kernel Upgrade

    Development team responsible for large cache-like application reported significant performance regression after…

    2 条评论
  • Hunting Down Elusive Memory Issues in a Java Applications

    Hunting Down Elusive Memory Issues in a Java Applications

    Sometimes, figuring out what's causing a problem can feel like solving a tough puzzle. I encountered few issues that…

    1 条评论
  • Digging Inside the JVM

    Digging Inside the JVM

    Building upon the insights from our previous discussion, let's dig deeper into the techniques employed by the JIT…

    2 条评论

社区洞察

其他会员也浏览了