登录查看更多内容

Hunting Down Elusive Memory Issues in a Java Applications

Aliaksei Dubrouski

Sr Staff Software Engineer at LinkedIn

发布日期: 2024年3月31日

Sometimes, figuring out what's causing a problem can feel like solving a tough puzzle.

I encountered few issues that were difficult to troubleshoot for several reasons. They were intermittent, hard to detect, and resembled race conditions. Today, I'd like to share the problems I was able to mitigate, but not entirely fix or prevent.

One of the development teams reported occasional crashes in a few instances of a Java-based application within a large cluster. These were JVM crashes, and they needed help investigating the problem.

After gathering some data, I realized the pattern was very puzzling. The JVM crashed while trying to allocate a new native memory chunk, with a message like this:

# Native memory allocation (malloc) failed to allocate 61536 bytes for Chunk::new

However, the method and amount of memory causing the crash always differed. Upon checking container metrics, it was clear that the process's RSS within the container was only around 65% of the quota. Additionally, if the total memory usage crossed the limit, the OOMKiller would likely kill it. The OOMKiller uses this formula to measure memory pressure:

adj = (long)p->signal->oom_score_adj;
...
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
...
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;
points += adj;

However, we were facing a JVM crash instead. Further investigation revealed that it only happened in one specific region, not across the entire cluster. This single clue helped me pinpoint a possible root cause. Instances in that region had a custom configuration and had significantly more established connections than other regions. Since this application generally has a lot of connections, and this region had an order of magnitude more, it could potentially lead to issues.

领英推荐

Covariance and Java Generics

Sanjoy Kumar Malik . 5 个月前

NB: Please note with Natalie - Security in Java

Natalie Badawy ?? 2 年前

Java 12 - New Features

Ahmed El-Sayed 2 年前

The actual network data packets are stored in kernel memory using sk_buff structures. When the kernel delivers data to the application, it copies it into user-space socket buffers allocated by the application. These buffers are not counted towards RSS. This might explain why the OOMKiller did not kill the process.

Updating the configuration to reduce the number of established connections allowed me to mitigate the problem. However, the specific reason behind the rejected malloc calls in the Linux kernel remains unknown.

We've seen similar issues in new Linux distributions and kernel versions. In a few very rare cases, a different application got killed by the OOMKiller, while its RSS was quite low compared to the cgroup quota, only around 70%. However, in this case, the application did not crash or have a large number of network connections. The only interesting correlation was the growth of the anon_file metric within the container. Java uses anon_file on multiple occasions, such as:

Heap (but the heap size was static and limited)
JIT compilation (profiling showed some compilation happening, but barely enough to explain the growth of anon_file)
Thread stacks (but the number of active threads was stable)
Class structures (but the application was fully warmed up and all classes were already loaded)

There's an interesting change (commit) in recent versions of the Linux kernel. However, the JVM is not compiled with MADV_FREE (as detailed in JDK-8196820, which remains open).

The biggest challenge with troubleshooting issues like these is their intermittent nature and the inability to reproduce them in a controlled environment. This makes it an interesting detective hunt.

Sayantan G.

11 个月

Aliaksei thank you for sharing. It was a fun puzzle to read and interesting debugging / mitigation approach!!

2 次回应

要查看或添加评论，请登录

Aliaksei Dubrouski的更多文章

Thresholds Maze

2025年3月15日

Thresholds Maze

Introduction Deoptimization storms in the JIT compiler are not uncommon, and in large organizations, they might occur…
The Secret Life of Caches

2025年2月8日

The Secret Life of Caches

It was a crisp late autumn morning in the San Francisco Bay Area, the kind that makes engineers appreciate a good cup…

8 条评论
Elusive Java Exception

2024年6月17日

Elusive Java Exception

One day, we received an email from the development team asking for help troubleshooting a perplexing exception. The…

2 条评论
How to overflow an integer in a jiffy.

2024年5月12日

How to overflow an integer in a jiffy.

In the annals of scientific measurement, the concept of a "jiffy" stands as a testament to the rapidity of light…

2 条评论
Vectorized Quick Sort Part 2

2024年5月5日

Vectorized Quick Sort Part 2

In my previous article, I explored a vectorized Quick Sort algorithm. To simplify things, I used a regular scalar sort…
Vectorized Quick Sort In JDK21

2024年4月28日

Vectorized Quick Sort In JDK21

This article explores the potential of the Vector API, introduced in JDK 21, to accelerate the classic QuickSort…
Pitfalls Of Code Generation

2024年4月21日

Pitfalls Of Code Generation

Fast Avro framework is the fastest serialization framework available for Java (at least in terms of deserialization…

8 条评论
Diagnosing Performance After Linux Kernel Upgrade

2024年4月14日

Diagnosing Performance After Linux Kernel Upgrade

Development team responsible for large cache-like application reported significant performance regression after…

2 条评论
JSSE vs BoringSSL for Java

2024年4月7日

JSSE vs BoringSSL for Java

A couple of years ago, we conducted an extensive research project comparing various implementations of the SSL stack…
Digging Inside the JVM

2024年3月24日

Digging Inside the JVM

Building upon the insights from our previous discussion, let's dig deeper into the techniques employed by the JIT…

2 条评论

See all articles

Hunting Down Elusive Memory Issues in a Java Applications

Aliaksei Dubrouski

Sr Staff Software Engineer at LinkedIn

领英推荐

Aliaksei Dubrouski的更多文章

社区洞察

其他会员也浏览了

Securing Java Applications: Common Pitfalls and How RASP Can Help

Java 8 - New Features

Base64 Encoding and Decoding in JAVA

How to Read Large File in Java

New Features in Java 17

Graphs in Java

?? Mastering Memory Management in Java: The Power of Heap Dumps ??

Java for Everyone... character

Understanding Deadlock in Java: Causes and Solutions

Understanding Deadlock in Java: Causes and Solutions

领英推荐

Aliaksei Dubrouski的更多文章

Thresholds Maze

The Secret Life of Caches

Elusive Java Exception

How to overflow an integer in a jiffy.

Vectorized Quick Sort Part 2

Vectorized Quick Sort In JDK21

Pitfalls Of Code Generation

Diagnosing Performance After Linux Kernel Upgrade

JSSE vs BoringSSL for Java

Digging Inside the JVM

社区洞察

其他会员也浏览了

Securing Java Applications: Common Pitfalls and How RASP Can Help

Java 8 - New Features

Base64 Encoding and Decoding in JAVA

How to Read Large File in Java

New Features in Java 17

Graphs in Java

?? Mastering Memory Management in Java: The Power of Heap Dumps ??

Java for Everyone... character

Understanding Deadlock in Java: Causes and Solutions

Understanding Deadlock in Java: Causes and Solutions