How to overflow an integer in a jiffy.

How to overflow an integer in a jiffy.

In the annals of scientific measurement, the concept of a "jiffy" stands as a testament to the rapidity of light. Coined by Gilbert Newton Lewis in the late 18th or early 19th century, a jiffy denotes the astonishingly brief interval it takes for light to traverse a mere centimeter in a vacuum - a mere 33.4 picoseconds, to be precise. In the realm of computing, however, a jiffy can encompass a far more complex and consequential temporal dimension.

Just a few years ago, a development team encountered a problem: recurring Java Virtual Machine (JVM) crashes were causing application availability. JVM's hs_err files looked like this:

#  SIGSEGV (0xb) at pc=0x00007f62416c862c, pid=4786, tid=4797
#
# JRE version: OpenJDK Runtime Environment Microsoft (11.0.8+10) (build 11.0.8+10-20200812)
# Problematic frame:
# V  [libjvm.so+0xbf762c]  compute_tree_cost(SwitchRange*, SwitchRange*, float)+0x21c
...
Current CompileTask:
C2:309668189 123722 %     4       ...SpecificDeserializer_4220230149122134420_6736619255424246870::deserializeGenericFilterTemplate0 @ 1441 (1604 bytes)
...
RDX=0x00007f25eb42e000 is an unknown value
...
Deoptimization events (20 events):
Event: 309668.065 Thread 0x00007f623d015800 Uncommon trap: reason=unstable_if action=none pc=0x00007f6230b68790 method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/
Event: 309668.065 Thread 0x00007f623cf8b800 Uncommon trap: reason=unstable_if action=none pc=0x00007f6230b68790 method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/
Event: 309668.065 Thread 0x00007f623d0a6000 Uncommon trap: reason=unstable_if action=none pc=0x00007f6230b68790 method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/
Event: 309668.065 Thread 0x00007f623cf9d800 Uncommon trap: reason=unstable_if action=none pc=0x00007f6230b68790 method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/
Event: 309668.065 Thread 0x00007f623d020000 Uncommon trap: reason=unstable_if action=none pc=0x00007f6230b68790 method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/
method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/        

It was crashing in Fast Avro deserializer. There was a bug reported in Oracle tracker, but it was resolved without a fix. My initial recommendation for development team was to upgrade to JDK v11.0.12.

Inspecting JVM binary with objdump:

objdump -d libjvm.so | less        

showed this:

bf762c:       f3 0f 58 52 10          addss  0x10(%rdx),%xmm2        

RDX register contains unknown value (from crash report). We have no debug symbols to map ASM code back to C++. At this point I had to ask MSFT Java group for help. Bernhard Urban-Forster helped to restore the sequence of events:

  • Code for SpecificDeserializer was compiled, however history of deoptimization events points that it was recently deoptimized multiple times
  • % sign in crash report means that JIT was compiling code intended for On-Stack-Replacement (OSR) when SIGSEGV happened

We acquired the source code of the generated Avro deserializer and interesting portion of it looked like this:

int unionIndex11 = (decoder.readIndex());
        switch (unionIndex11) {
            case  0 :
                decoder.readNull();
                break;
            case  1 :
            {
                List<Utf8> languagesOption0 = null;
...
			while (chunkLen14 > 0) {
                    for (int counter14 = 0; (counter14 <chunkLen14); counter14 ++) {        

Crash report points that this switch construct was frequently executed and behavior was constantly changing thus triggering deoptimizations events resulting in recompilation followed by OSR replacement due to tight loop inside this case.

ASM instruction which triggered the crash indicates that this is a SwitchRange._cnt access; cnt is the only floating point var and SwitchRange has 4x ints (=> offset 0x10) before cnt, and there is only one access to _cnt: https://github.com/lewurm/jdk11u/blob/6c31ac2acdc2b2efa63fe92de8368ab964d847e9/src/hotspot/share/opto/parse2.cpp#L668

The Real Root Cause

Root cause was reproduced and identified by as integer overflow in “void Parse::do_lookupswitch” function, which was fixed in a separate independent patch:

https://github.com/openjdk/jdk11u-dev/commit/1fb47a75e05df4873e7277ec39bda39cb336d1d0

and merged to v11.0.10.

You need a negative count to reproduce the issue. This only can happen if the branch overflows, since it's a signed integer. So we’ll need to enter a switch branch 2^31 times while the JVM is profiling the method. That's the hard part, because usually a compilation happens first, and as soon as the compiled method is ready, the JVM uses it and that comes without profiling counters. What helps is, if the method is deoptimized for this SwitchProfile stuff the counters are not reset.

Fast Avro generated code contains a huge amount of switch blocks and these methods used a huge amount of times as application constantly serializes and deserializes data. That in conjunction with low thresholds for optimizing the method caused integer overflow and crash. Overall, the chance for overflow is very low, as it requires a combination of factors like high request per second rates, changing application behavior triggering deoptimization traps, specific auto-generated deserialized code, and, may be, a tiny bit of bad luck.

In the end, we also optimized Fast Avro deserializer code generator to avoid frequent deoptimization traps.



Félix GV

Planet-Scale Systems

10 个月

Good times. So you say this is fixed in the JVM now? And so we could go back to switch statements instead of if/else/else/else/else/etc ?? ?

回复

要查看或添加评论,请登录

Aliaksei Dubrouski的更多文章

  • Thresholds Maze

    Thresholds Maze

    Introduction Deoptimization storms in the JIT compiler are not uncommon, and in large organizations, they might occur…

  • The Secret Life of Caches

    The Secret Life of Caches

    It was a crisp late autumn morning in the San Francisco Bay Area, the kind that makes engineers appreciate a good cup…

    8 条评论
  • Elusive Java Exception

    Elusive Java Exception

    One day, we received an email from the development team asking for help troubleshooting a perplexing exception. The…

    2 条评论
  • Vectorized Quick Sort Part 2

    Vectorized Quick Sort Part 2

    In my previous article, I explored a vectorized Quick Sort algorithm. To simplify things, I used a regular scalar sort…

  • Vectorized Quick Sort In JDK21

    Vectorized Quick Sort In JDK21

    This article explores the potential of the Vector API, introduced in JDK 21, to accelerate the classic QuickSort…

  • Pitfalls Of Code Generation

    Pitfalls Of Code Generation

    Fast Avro framework is the fastest serialization framework available for Java (at least in terms of deserialization…

    8 条评论
  • Diagnosing Performance After Linux Kernel Upgrade

    Diagnosing Performance After Linux Kernel Upgrade

    Development team responsible for large cache-like application reported significant performance regression after…

    2 条评论
  • JSSE vs BoringSSL for Java

    JSSE vs BoringSSL for Java

    A couple of years ago, we conducted an extensive research project comparing various implementations of the SSL stack…

  • Hunting Down Elusive Memory Issues in a Java Applications

    Hunting Down Elusive Memory Issues in a Java Applications

    Sometimes, figuring out what's causing a problem can feel like solving a tough puzzle. I encountered few issues that…

    1 条评论
  • Digging Inside the JVM

    Digging Inside the JVM

    Building upon the insights from our previous discussion, let's dig deeper into the techniques employed by the JIT…

    2 条评论

社区洞察

其他会员也浏览了