How to overflow an integer in a jiffy.
In the annals of scientific measurement, the concept of a "jiffy" stands as a testament to the rapidity of light. Coined by Gilbert Newton Lewis in the late 18th or early 19th century, a jiffy denotes the astonishingly brief interval it takes for light to traverse a mere centimeter in a vacuum - a mere 33.4 picoseconds, to be precise. In the realm of computing, however, a jiffy can encompass a far more complex and consequential temporal dimension.
Just a few years ago, a development team encountered a problem: recurring Java Virtual Machine (JVM) crashes were causing application availability. JVM's hs_err files looked like this:
# SIGSEGV (0xb) at pc=0x00007f62416c862c, pid=4786, tid=4797
#
# JRE version: OpenJDK Runtime Environment Microsoft (11.0.8+10) (build 11.0.8+10-20200812)
# Problematic frame:
# V [libjvm.so+0xbf762c] compute_tree_cost(SwitchRange*, SwitchRange*, float)+0x21c
...
Current CompileTask:
C2:309668189 123722 % 4 ...SpecificDeserializer_4220230149122134420_6736619255424246870::deserializeGenericFilterTemplate0 @ 1441 (1604 bytes)
...
RDX=0x00007f25eb42e000 is an unknown value
...
Deoptimization events (20 events):
Event: 309668.065 Thread 0x00007f623d015800 Uncommon trap: reason=unstable_if action=none pc=0x00007f6230b68790 method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/
Event: 309668.065 Thread 0x00007f623cf8b800 Uncommon trap: reason=unstable_if action=none pc=0x00007f6230b68790 method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/
Event: 309668.065 Thread 0x00007f623d0a6000 Uncommon trap: reason=unstable_if action=none pc=0x00007f6230b68790 method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/
Event: 309668.065 Thread 0x00007f623cf9d800 Uncommon trap: reason=unstable_if action=none pc=0x00007f6230b68790 method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/
Event: 309668.065 Thread 0x00007f623d020000 Uncommon trap: reason=unstable_if action=none pc=0x00007f6230b68790 method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/
method=SpecificDeserializer_4220230149122134420_6736619255424246870.deserializeGenericFilterTemplate0(Ljava/
It was crashing in Fast Avro deserializer. There was a bug reported in Oracle tracker, but it was resolved without a fix. My initial recommendation for development team was to upgrade to JDK v11.0.12.
Inspecting JVM binary with objdump:
objdump -d libjvm.so | less
showed this:
bf762c: f3 0f 58 52 10 addss 0x10(%rdx),%xmm2
RDX register contains unknown value (from crash report). We have no debug symbols to map ASM code back to C++. At this point I had to ask MSFT Java group for help. Bernhard Urban-Forster helped to restore the sequence of events:
We acquired the source code of the generated Avro deserializer and interesting portion of it looked like this:
int unionIndex11 = (decoder.readIndex());
switch (unionIndex11) {
case 0 :
decoder.readNull();
break;
case 1 :
{
List<Utf8> languagesOption0 = null;
...
while (chunkLen14 > 0) {
for (int counter14 = 0; (counter14 <chunkLen14); counter14 ++) {
Crash report points that this switch construct was frequently executed and behavior was constantly changing thus triggering deoptimizations events resulting in recompilation followed by OSR replacement due to tight loop inside this case.
ASM instruction which triggered the crash indicates that this is a SwitchRange._cnt access; cnt is the only floating point var and SwitchRange has 4x ints (=> offset 0x10) before cnt, and there is only one access to _cnt: https://github.com/lewurm/jdk11u/blob/6c31ac2acdc2b2efa63fe92de8368ab964d847e9/src/hotspot/share/opto/parse2.cpp#L668
The Real Root Cause
Root cause was reproduced and identified by as integer overflow in “void Parse::do_lookupswitch” function, which was fixed in a separate independent patch:
and merged to v11.0.10.
You need a negative count to reproduce the issue. This only can happen if the branch overflows, since it's a signed integer. So we’ll need to enter a switch branch 2^31 times while the JVM is profiling the method. That's the hard part, because usually a compilation happens first, and as soon as the compiled method is ready, the JVM uses it and that comes without profiling counters. What helps is, if the method is deoptimized for this SwitchProfile stuff the counters are not reset.
Fast Avro generated code contains a huge amount of switch blocks and these methods used a huge amount of times as application constantly serializes and deserializes data. That in conjunction with low thresholds for optimizing the method caused integer overflow and crash. Overall, the chance for overflow is very low, as it requires a combination of factors like high request per second rates, changing application behavior triggering deoptimization traps, specific auto-generated deserialized code, and, may be, a tiny bit of bad luck.
In the end, we also optimized Fast Avro deserializer code generator to avoid frequent deoptimization traps.
Planet-Scale Systems
10 个月Good times. So you say this is fixed in the JVM now? And so we could go back to switch statements instead of if/else/else/else/else/etc ?? ?