Hunting Down Elusive Memory Issues in a Java Applications

Hunting Down Elusive Memory Issues in a Java Applications

Sometimes, figuring out what's causing a problem can feel like solving a tough puzzle.

I encountered few issues that were difficult to troubleshoot for several reasons. They were intermittent, hard to detect, and resembled race conditions. Today, I'd like to share the problems I was able to mitigate, but not entirely fix or prevent.

One of the development teams reported occasional crashes in a few instances of a Java-based application within a large cluster. These were JVM crashes, and they needed help investigating the problem.

After gathering some data, I realized the pattern was very puzzling. The JVM crashed while trying to allocate a new native memory chunk, with a message like this:

# Native memory allocation (malloc) failed to allocate 61536 bytes for Chunk::new        

However, the method and amount of memory causing the crash always differed. Upon checking container metrics, it was clear that the process's RSS within the container was only around 65% of the quota. Additionally, if the total memory usage crossed the limit, the OOMKiller would likely kill it. The OOMKiller uses this formula to measure memory pressure:

adj = (long)p->signal->oom_score_adj;
...
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
...
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;
points += adj;        

However, we were facing a JVM crash instead. Further investigation revealed that it only happened in one specific region, not across the entire cluster. This single clue helped me pinpoint a possible root cause. Instances in that region had a custom configuration and had significantly more established connections than other regions. Since this application generally has a lot of connections, and this region had an order of magnitude more, it could potentially lead to issues.

The actual network data packets are stored in kernel memory using sk_buff structures. When the kernel delivers data to the application, it copies it into user-space socket buffers allocated by the application. These buffers are not counted towards RSS. This might explain why the OOMKiller did not kill the process.

Updating the configuration to reduce the number of established connections allowed me to mitigate the problem. However, the specific reason behind the rejected malloc calls in the Linux kernel remains unknown.

We've seen similar issues in new Linux distributions and kernel versions. In a few very rare cases, a different application got killed by the OOMKiller, while its RSS was quite low compared to the cgroup quota, only around 70%. However, in this case, the application did not crash or have a large number of network connections. The only interesting correlation was the growth of the anon_file metric within the container. Java uses anon_file on multiple occasions, such as:

  • Heap (but the heap size was static and limited)
  • JIT compilation (profiling showed some compilation happening, but barely enough to explain the growth of anon_file)
  • Thread stacks (but the number of active threads was stable)
  • Class structures (but the application was fully warmed up and all classes were already loaded)

There's an interesting change (commit) in recent versions of the Linux kernel. However, the JVM is not compiled with MADV_FREE (as detailed in JDK-8196820, which remains open).

The biggest challenge with troubleshooting issues like these is their intermittent nature and the inability to reproduce them in a controlled environment. This makes it an interesting detective hunt.



Aliaksei thank you for sharing. It was a fun puzzle to read and interesting debugging / mitigation approach!!

要查看或添加评论,请登录

Aliaksei Dubrouski的更多文章

  • Thresholds Maze

    Thresholds Maze

    Introduction Deoptimization storms in the JIT compiler are not uncommon, and in large organizations, they might occur…

  • The Secret Life of Caches

    The Secret Life of Caches

    It was a crisp late autumn morning in the San Francisco Bay Area, the kind that makes engineers appreciate a good cup…

    8 条评论
  • Elusive Java Exception

    Elusive Java Exception

    One day, we received an email from the development team asking for help troubleshooting a perplexing exception. The…

    2 条评论
  • How to overflow an integer in a jiffy.

    How to overflow an integer in a jiffy.

    In the annals of scientific measurement, the concept of a "jiffy" stands as a testament to the rapidity of light…

    2 条评论
  • Vectorized Quick Sort Part 2

    Vectorized Quick Sort Part 2

    In my previous article, I explored a vectorized Quick Sort algorithm. To simplify things, I used a regular scalar sort…

  • Vectorized Quick Sort In JDK21

    Vectorized Quick Sort In JDK21

    This article explores the potential of the Vector API, introduced in JDK 21, to accelerate the classic QuickSort…

  • Pitfalls Of Code Generation

    Pitfalls Of Code Generation

    Fast Avro framework is the fastest serialization framework available for Java (at least in terms of deserialization…

    8 条评论
  • Diagnosing Performance After Linux Kernel Upgrade

    Diagnosing Performance After Linux Kernel Upgrade

    Development team responsible for large cache-like application reported significant performance regression after…

    2 条评论
  • JSSE vs BoringSSL for Java

    JSSE vs BoringSSL for Java

    A couple of years ago, we conducted an extensive research project comparing various implementations of the SSL stack…

  • Digging Inside the JVM

    Digging Inside the JVM

    Building upon the insights from our previous discussion, let's dig deeper into the techniques employed by the JIT…

    2 条评论

社区洞察

其他会员也浏览了