Pitfalls Of Code Generation
Apache Avro

Pitfalls Of Code Generation

Fast Avro framework is the fastest serialization framework available for Java (at least in terms of deserialization speed). Originally developed by RTBHouse (avro-fastserde), it has one of the smallest message payloads, second only to Kryo. LinkedIn engineers further enhanced Fast Avro by introducing limited compatibility with various vanilla Avro versions and implementing several memory allocation optimizations (some details were presented in my QCon talk on optimizing Venice DB performance).

The secret to Fast Avro's deserialization speed lies in its ability to dynamically generate and compile specialized SerDe classes for each unique Avro schema encountered. These classes are tailored to the specific data structures defined in the schema, leading to performance gains in several ways:

  • SerDe classes directly leverage primitive Java data types (int, long, String, etc.) for efficient data reading and writing, eliminating unnecessary conversions or object allocations.
  • For complex schema elements like arrays or maps, SerDe classes might utilize optimized Java collection classes or even custom-built data structures specifically designed for fast serialization and deserialization.
  • Memory access patterns are also optimized. If the schema involves fixed-size data structures, SerDe classes might employ efficient VarHandles and memory buffers for bulk data access.
  • Frequently used code snippets within the generated serialization/deserialization routines are inlined, reducing method call overhead and boosting performance.

While these optimizations sound impressive on paper, significant effort was invested behind the scenes to develop and integrate these techniques effectively. Even with these advancements, challenges can still arise.

Today I would like describe one interesting performance issue we faced while using Fast Avro. Since SerDe classes are generated on the fly, they are compiled using the javac compiler into bytecode and then further optimized with the help of the Just-In-Time (JIT) compiler. Although most Java developers don't have to dive into the specifics of this process, it contains interesting pitfalls.

JIT compiler's primary unit of compilation is the method. Additionally, there are strict limitations on the size of these methods (8000 bytes). These size thresholds are established for several reasons:

  • Typically, only a small portion of a large method is frequently executed (the "hot path"). Even if the overall method is large, only a subset of lines will be critically performance-sensitive.
  • Smaller methods are generally easier for the JIT compiler to optimize effectively.
  • Splitting large methods into smaller ones can improve instruction cache hit rates, further enhancing performance.

These factors can create a challenge for Fast Avro. If a particularly complex Avro schema necessitates a large SerDe class with extensive methods, it might exceed the JIT compilation method size limit. This can hinder performance gains as the JIT compiler struggles to optimize the oversized methods.

Following a recent deployment with a new Avro schema version, one of our development teams encountered a surge in CPU usage and latency. Profiling revealed the culprit: the JVM was utilizing Interpreted-to-Compiled (I2C) and Compiled-to-Interpreted (C2I) adapters.

I2C/C2I adapters

These adapters, inserted by the compiler, act as bridges between compiled and interpreted code. Their presence on a frequently executed code path (hot path) is a red flag, as interpreted code is significantly slower than compiled code.

Fast Avro dumps the generated bytecode to disk. This allowed me to take a closer look using javap:

javap -c GenericDeserializer_2322729746181669945_207929154085253998.class | less        

The culprit was the deserialize0 method within the generated GenericDeserializer class with length of 8044 bytes

public class GenericDeserializer_2322729746181669945_207929154085253998
  public org.apache.avro.generic.IndexedRecord deserialize0(java.lang.Object, org.apache.avro.io.Decoder) throws java.io.IOException;
    Code:
       0: aload_1
....
    8043: aload_3
    8044: areturn        

(Code column represent byte offset)

In human-written code, we limit the size of the methods to improve readability and maintainability. However, auto-generated code often prioritizes logic over aesthetics, resulting in methods that mirror the complexity of the underlying schema.

We have two solutions to solve this issue:

  1. Disabling the Huge Method Threshold (Risky):

The -XX:-DontCompileHugeMethods flag instructs the JIT compiler to bypass its size threshold for compilation. While this might solve the immediate performance bottleneck, it can have unforeseen consequences.

  1. Optimizing Code Generation with Fast Avro:

Fast Avro offers a configuration option, fast.avro.field.limit.per.method, that allows to set a limit on the number of fields processed within a single method. This effectively splits the large deserialization method into smaller, more manageable chunks, ensuring the JIT compiler can optimize each one effectively.

Thanks Gaojie Liu for your help.

Ivan Burmistrov

Staff Software Engineer at Meta

11 个月

How does Fast Avro compare to Fury (https://github.com/apache/incubator-fury)?

回复

要查看或添加评论,请登录

Aliaksei Dubrouski的更多文章

  • Thresholds Maze

    Thresholds Maze

    Introduction Deoptimization storms in the JIT compiler are not uncommon, and in large organizations, they might occur…

  • The Secret Life of Caches

    The Secret Life of Caches

    It was a crisp late autumn morning in the San Francisco Bay Area, the kind that makes engineers appreciate a good cup…

    8 条评论
  • Elusive Java Exception

    Elusive Java Exception

    One day, we received an email from the development team asking for help troubleshooting a perplexing exception. The…

    2 条评论
  • How to overflow an integer in a jiffy.

    How to overflow an integer in a jiffy.

    In the annals of scientific measurement, the concept of a "jiffy" stands as a testament to the rapidity of light…

    2 条评论
  • Vectorized Quick Sort Part 2

    Vectorized Quick Sort Part 2

    In my previous article, I explored a vectorized Quick Sort algorithm. To simplify things, I used a regular scalar sort…

  • Vectorized Quick Sort In JDK21

    Vectorized Quick Sort In JDK21

    This article explores the potential of the Vector API, introduced in JDK 21, to accelerate the classic QuickSort…

  • Diagnosing Performance After Linux Kernel Upgrade

    Diagnosing Performance After Linux Kernel Upgrade

    Development team responsible for large cache-like application reported significant performance regression after…

    2 条评论
  • JSSE vs BoringSSL for Java

    JSSE vs BoringSSL for Java

    A couple of years ago, we conducted an extensive research project comparing various implementations of the SSL stack…

  • Hunting Down Elusive Memory Issues in a Java Applications

    Hunting Down Elusive Memory Issues in a Java Applications

    Sometimes, figuring out what's causing a problem can feel like solving a tough puzzle. I encountered few issues that…

    1 条评论
  • Digging Inside the JVM

    Digging Inside the JVM

    Building upon the insights from our previous discussion, let's dig deeper into the techniques employed by the JIT…

    2 条评论

社区洞察

其他会员也浏览了