登录查看更多内容

Pitfalls Of Code Generation

Aliaksei Dubrouski

Sr Staff Software Engineer at LinkedIn

发布日期: 2024年4月21日

Fast Avro framework is the fastest serialization framework available for Java (at least in terms of deserialization speed). Originally developed by RTBHouse (avro-fastserde), it has one of the smallest message payloads, second only to Kryo. LinkedIn engineers further enhanced Fast Avro by introducing limited compatibility with various vanilla Avro versions and implementing several memory allocation optimizations (some details were presented in my QCon talk on optimizing Venice DB performance).

The secret to Fast Avro's deserialization speed lies in its ability to dynamically generate and compile specialized SerDe classes for each unique Avro schema encountered. These classes are tailored to the specific data structures defined in the schema, leading to performance gains in several ways:

SerDe classes directly leverage primitive Java data types (int, long, String, etc.) for efficient data reading and writing, eliminating unnecessary conversions or object allocations.
For complex schema elements like arrays or maps, SerDe classes might utilize optimized Java collection classes or even custom-built data structures specifically designed for fast serialization and deserialization.
Memory access patterns are also optimized. If the schema involves fixed-size data structures, SerDe classes might employ efficient VarHandles and memory buffers for bulk data access.
Frequently used code snippets within the generated serialization/deserialization routines are inlined, reducing method call overhead and boosting performance.

While these optimizations sound impressive on paper, significant effort was invested behind the scenes to develop and integrate these techniques effectively. Even with these advancements, challenges can still arise.

Today I would like describe one interesting performance issue we faced while using Fast Avro. Since SerDe classes are generated on the fly, they are compiled using the javac compiler into bytecode and then further optimized with the help of the Just-In-Time (JIT) compiler. Although most Java developers don't have to dive into the specifics of this process, it contains interesting pitfalls.

JIT compiler's primary unit of compilation is the method. Additionally, there are strict limitations on the size of these methods (8000 bytes). These size thresholds are established for several reasons:

Typically, only a small portion of a large method is frequently executed (the "hot path"). Even if the overall method is large, only a subset of lines will be critically performance-sensitive.
Smaller methods are generally easier for the JIT compiler to optimize effectively.
Splitting large methods into smaller ones can improve instruction cache hit rates, further enhancing performance.

These factors can create a challenge for Fast Avro. If a particularly complex Avro schema necessitates a large SerDe class with extensive methods, it might exceed the JIT compilation method size limit. This can hinder performance gains as the JIT compiler struggles to optimize the oversized methods.

Following a recent deployment with a new Avro schema version, one of our development teams encountered a surge in CPU usage and latency. Profiling revealed the culprit: the JVM was utilizing Interpreted-to-Compiled (I2C) and Compiled-to-Interpreted (C2I) adapters.

These adapters, inserted by the compiler, act as bridges between compiled and interpreted code. Their presence on a frequently executed code path (hot path) is a red flag, as interpreted code is significantly slower than compiled code.

Fast Avro dumps the generated bytecode to disk. This allowed me to take a closer look using javap:

领英推荐

Mastering Loops in Java: A Beginner's Guide

Crio.Do 5 个月前

Feature Freeze for JDK 22: What Will the New Edition…

Artur Skowroński 1 年前

Project Babylon: Chance for LINQ (and more) in Java -…

Artur Skowroński 1 年前

javap -c GenericDeserializer_2322729746181669945_207929154085253998.class | less

The culprit was the deserialize0 method within the generated GenericDeserializer class with length of 8044 bytes

public class GenericDeserializer_2322729746181669945_207929154085253998
  public org.apache.avro.generic.IndexedRecord deserialize0(java.lang.Object, org.apache.avro.io.Decoder) throws java.io.IOException;
    Code:
       0: aload_1
....
    8043: aload_3
    8044: areturn

(Code column represent byte offset)

In human-written code, we limit the size of the methods to improve readability and maintainability. However, auto-generated code often prioritizes logic over aesthetics, resulting in methods that mirror the complexity of the underlying schema.

We have two solutions to solve this issue:

Disabling the Huge Method Threshold (Risky):

The -XX:-DontCompileHugeMethods flag instructs the JIT compiler to bypass its size threshold for compilation. While this might solve the immediate performance bottleneck, it can have unforeseen consequences.

Optimizing Code Generation with Fast Avro:

Fast Avro offers a configuration option, fast.avro.field.limit.per.method, that allows to set a limit on the number of fields processed within a single method. This effectively splits the large deserialization method into smaller, more manageable chunks, ensuring the JIT compiler can optimize each one effectively.

Thanks Gaojie Liu for your help.

Ivan Burmistrov

Staff Software Engineer at Meta

11 个月

How does Fast Avro compare to Fury (https://github.com/apache/incubator-fury)?

查看更多评论

要查看或添加评论，请登录

Aliaksei Dubrouski的更多文章

Thresholds Maze

2025年3月15日

Thresholds Maze

Introduction Deoptimization storms in the JIT compiler are not uncommon, and in large organizations, they might occur…
The Secret Life of Caches

2025年2月8日

The Secret Life of Caches

It was a crisp late autumn morning in the San Francisco Bay Area, the kind that makes engineers appreciate a good cup…

8 条评论
Elusive Java Exception

2024年6月17日

Elusive Java Exception

One day, we received an email from the development team asking for help troubleshooting a perplexing exception. The…

2 条评论
How to overflow an integer in a jiffy.

2024年5月12日

How to overflow an integer in a jiffy.

In the annals of scientific measurement, the concept of a "jiffy" stands as a testament to the rapidity of light…

2 条评论
Vectorized Quick Sort Part 2

2024年5月5日

Vectorized Quick Sort Part 2

In my previous article, I explored a vectorized Quick Sort algorithm. To simplify things, I used a regular scalar sort…
Vectorized Quick Sort In JDK21

2024年4月28日

Vectorized Quick Sort In JDK21

This article explores the potential of the Vector API, introduced in JDK 21, to accelerate the classic QuickSort…
Diagnosing Performance After Linux Kernel Upgrade

2024年4月14日

Diagnosing Performance After Linux Kernel Upgrade

Development team responsible for large cache-like application reported significant performance regression after…

2 条评论
JSSE vs BoringSSL for Java

2024年4月7日

JSSE vs BoringSSL for Java

A couple of years ago, we conducted an extensive research project comparing various implementations of the SSL stack…
Hunting Down Elusive Memory Issues in a Java Applications

2024年3月31日

Hunting Down Elusive Memory Issues in a Java Applications

Sometimes, figuring out what's causing a problem can feel like solving a tough puzzle. I encountered few issues that…

1 条评论
Digging Inside the JVM

2024年3月24日

Digging Inside the JVM

Building upon the insights from our previous discussion, let's dig deeper into the techniques employed by the JIT…

2 条评论

See all articles

Pitfalls Of Code Generation

Aliaksei Dubrouski

Sr Staff Software Engineer at LinkedIn

领英推荐

Aliaksei Dubrouski的更多文章

社区洞察

其他会员也浏览了

New JEPs: Computed Constants, a new approach to Ahead-of-Time and stabilisation of the FFM API - JVM Weekly vol. 52

Java Record

What's new in the Performance JDK? Project Skogsluft, FFM vs Unsafe, Benchmark JITs - JVM Weekly vol. 71

Java vs. Python: Which is Better for Enterprise Applications?

Java 8 - New Features

Java Parallel GC Tuning

Mastering Multithreading in Java: Part 12 – Unlocking Thread Pools for Efficient Task Execution

Understanding the Java Memory Model (JMM)

Java Lambda Expressions - Deep Dive

Leveraging Virtual Threads with Project Loom in Java 21 for High-Concurrency Applications

领英推荐

Aliaksei Dubrouski的更多文章

Thresholds Maze

The Secret Life of Caches

Elusive Java Exception

How to overflow an integer in a jiffy.

Vectorized Quick Sort Part 2

Vectorized Quick Sort In JDK21

Diagnosing Performance After Linux Kernel Upgrade

JSSE vs BoringSSL for Java

Hunting Down Elusive Memory Issues in a Java Applications

Digging Inside the JVM

社区洞察

其他会员也浏览了

New JEPs: Computed Constants, a new approach to Ahead-of-Time and stabilisation of the FFM API - JVM Weekly vol. 52

Java Record

What's new in the Performance JDK? Project Skogsluft, FFM vs Unsafe, Benchmark JITs - JVM Weekly vol. 71

Java vs. Python: Which is Better for Enterprise Applications?

Java 8 - New Features

Java Parallel GC Tuning

Mastering Multithreading in Java: Part 12 – Unlocking Thread Pools for Efficient Task Execution

Understanding the Java Memory Model (JMM)

Java Lambda Expressions - Deep Dive

Leveraging Virtual Threads with Project Loom in Java 21 for High-Concurrency Applications