Hunting Down Elusive Memory Issues in a Java Applications
Sometimes, figuring out what's causing a problem can feel like solving a tough puzzle.
I encountered few issues that were difficult to troubleshoot for several reasons. They were intermittent, hard to detect, and resembled race conditions. Today, I'd like to share the problems I was able to mitigate, but not entirely fix or prevent.
One of the development teams reported occasional crashes in a few instances of a Java-based application within a large cluster. These were JVM crashes, and they needed help investigating the problem.
After gathering some data, I realized the pattern was very puzzling. The JVM crashed while trying to allocate a new native memory chunk, with a message like this:
# Native memory allocation (malloc) failed to allocate 61536 bytes for Chunk::new
However, the method and amount of memory causing the crash always differed. Upon checking container metrics, it was clear that the process's RSS within the container was only around 65% of the quota. Additionally, if the total memory usage crossed the limit, the OOMKiller would likely kill it. The OOMKiller uses this formula to measure memory pressure:
adj = (long)p->signal->oom_score_adj;
...
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
mm_pgtables_bytes(p->mm) / PAGE_SIZE;
...
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;
points += adj;
However, we were facing a JVM crash instead. Further investigation revealed that it only happened in one specific region, not across the entire cluster. This single clue helped me pinpoint a possible root cause. Instances in that region had a custom configuration and had significantly more established connections than other regions. Since this application generally has a lot of connections, and this region had an order of magnitude more, it could potentially lead to issues.
领英推荐
The actual network data packets are stored in kernel memory using sk_buff structures. When the kernel delivers data to the application, it copies it into user-space socket buffers allocated by the application. These buffers are not counted towards RSS. This might explain why the OOMKiller did not kill the process.
Updating the configuration to reduce the number of established connections allowed me to mitigate the problem. However, the specific reason behind the rejected malloc calls in the Linux kernel remains unknown.
We've seen similar issues in new Linux distributions and kernel versions. In a few very rare cases, a different application got killed by the OOMKiller, while its RSS was quite low compared to the cgroup quota, only around 70%. However, in this case, the application did not crash or have a large number of network connections. The only interesting correlation was the growth of the anon_file metric within the container. Java uses anon_file on multiple occasions, such as:
There's an interesting change (commit) in recent versions of the Linux kernel. However, the JVM is not compiled with MADV_FREE (as detailed in JDK-8196820, which remains open).
The biggest challenge with troubleshooting issues like these is their intermittent nature and the inability to reproduce them in a controlled environment. This makes it an interesting detective hunt.
Aliaksei thank you for sharing. It was a fun puzzle to read and interesting debugging / mitigation approach!!