Log4J Odyssey: Unraveling the Mysteries of Code, ClassLoaders, and the Fast Lane to the Slow Path
Last year, we observed a decline in the performance of our systems in specific scenarios under heavy load. Despite the automated systems effectively preventing damage, our engineers were tasked with identifying the root cause. This investigation unfolded as one of the most extensive and captivating endeavors of the year.
Considerable time was dedicated to exploring the issue alongside Dave Messink and other engineers. Eventually, we traced the root cause to a version change in the Log4J library, specifically from 2.17.1 to versions beyond. Research revealed that the default behavior of Log4J loggers had shifted, with the "includeLocation" flag becoming true by default, resulting in significant overhead. In this context, including location meant that Log4J loggers would print stack traces with not only class names but also method names and exact line numbers, necessitating extended class information. Since all loggers were expected to include location in versions 2.17.2 and above, this was identified as the core issue.
Dave Messink submitted a patch (LOG4J2-3487) aimed at mitigating the problem, and internally, we implemented additional safeguards. To our surprise, performance bottlenecks persisted with identical symptoms. Subsequently, a workgroup was formed to delve deeper into the problem.
Troubleshooting a performance bottleneck in a real application can be challenging due to the sheer complexity of business logic and the specific conditions required to trigger the problem. In this instance, stress testing the system while logging a high number of exceptions was necessary.
To streamline the investigation process, a demo project was created, minimizing the overhead of setting up, building, and running the actual application. The demo comprised a single Java class and a Log4J configuration file. This class threw an exception and logged it using a logger similar to what we use in our production applications.
Surprisingly, every recent version of Log4J in the demo application exhibited the same problem, despite all configuration changes. The symptom was clear: every time Log4J needed to log a class, it used a slow path in the code, attempting to load a class via classloader and cache it. Older versions were able to immediately match class names and use a fast path, avoiding the expensive class loading operation.
领英推荐
The importance of avoiding class loading became evident due to the synchronized nature of the loadClass method in the classloader's architecture. Multiple threads attempting to load the same class would result in one proceeding while others waited. Additionally, loading classes could involve reading jar files from disk, making the operation exceptionally costly.
The behavior of the demo project was perplexing, given the initial belief that the root cause lay in a complex Log4J configuration and a change in the default behavior of loggers. After thorough investigation, it became evident that recent commits merged to the release branch (CH1 and CH2), coinciding with the default behavior change in version 2.17.2, were responsible. These commits aimed to enhance performance by swapping the fully synchronized Stack data structure with a non-synchronized one, ArrayDeque, and reducing memory allocation. However, the differing logic of these two data structures meant that the same methods of the Deque interface would return different data.
As a result, we consistently entered the slow path, disregarding the current thread's stack and loading classes via the classloader.
We also took a proactive step by developing a Java Microbenchmarking Harness (JMH) benchmark specifically tailored to test the previously mentioned scenario. This benchmark was designed to replicate the conditions leading to the significant drop in throughput observed in our systems under heavy load.
A pull request was submitted and merged, rectifying the logic of accessing elements in the Deque and introducing unit tests to prevent similar issues in the future.
One of the most enlightening takeaways from this experience is the importance of not expecting a single release to have only one major change that might affect performance.
Aliaksei Dubrouski, the team did excellent work tracking down the issue, focused on tru root cause, with the end result of helping all who leverage log4J. Thank you