9 Micro-Metrics That Forecast Production Outages in Performance Labs

In enterprise grade applications, before the new release goes to production, performance tests are conducted. As part of this test, the Performance QA team studies various metrics to make sure there is no degradation in the application performance. They primarily study the following metrics:

Response time of key transactions
CPU utilization
Memory utilization

I would like to categorize these metrics as macro metrics. Macro metrics are great; however, they have couple of shortcomings:

a. Performance Problems are Not Caught

Performance problems that we see in production do happen in our performance labs, however they happen at an acute scale and are not big enough to tip the threshold of macro metrics. Thus these acute degradations?goes unnoticed and manifests as major performance problems in production.?

b. Doesn’t Facilitate in Troubleshooting

Macro metrics to a major extent does not facilitate development teams to debug and troubleshoot the problems. Macro metrics will indicate that CPU consumption is high, but there will be no indication whether CPU consumption increased because of heavy Garbage Collection activity or thread looping problem or some other coding issue. Macro metric will indicate that there is a degradation in the response time, however it won’t indicate whether degradation is because of the locks in the application code or a backend connectivity issue…

Micro Metrics to Monitor in Performance Labs

Macro metrics should be complemented with micro metrics to address the above-mentioned shortcomings. In this article, I have listed 9 micro metrics that will facilitate to capture acute performance degradations in your new releases:?

GC Behavior Pattern?
Object Creation Rate
GC Throughput
GC Pause Time (Avg & Max)
Thread Patterns
Thread States
Thread activities within the Pool?
TCP/IP Connection Count & States
Error Trends in Application Logs

Let’s review these Micro Metrics in detail in this post. Why they are important, what problems they bring to your visibility.?

1. GC Behavior Pattern?

Garbage Collection heavily influences the application’s performance, studying the Garbage Collection behavior will facilitate you to forecast the memory related bottlenecks.?

Fig: Garbage Collection Behavior of a Healthy Application

The above graph shows the GC behavior of a healthy application. You can see a beautiful saw-tooth GC pattern.?You can notice that when the heap usage reaches ~5.8GB, ‘Full GC’ event (red triangle) gets triggered. When the ‘Full GC’ event runs, memory utilization drops all the way to the bottom i.e., ~200MB. Please see the dotted black line in the graph which connects all the bottom points. You can notice this dotted black line is going at 0°. It indicates that the application is in a healthy state & not suffering from any sort of memory problems.?

Fig: Application Suffering from Acute Memory Leak in Performance Lab

Above is the garbage collection behavior of an application that is suffering from an acute memory leak. When an application suffers from this pattern, heap usage will climb up slowly, eventually resulting in OutOfMemoryError.?

In the above figure, you can notice that ‘Full GC’ (red triangle) event gets triggered when heap usage reaches around ~8GB. In the graph, you can also observe that amount of heap that full GC events could recover starts to decline over a period, i.e.

a. When the first Full GC event ran, heap usage dropped to 3.9GB?

b. When the second Full GC event ran, heap usage dropped only to 4.5GB

c. When the third Full GC event ran, heap usage dropped only to 5GB

d. When the final full GC event ran heap usage dropped only to 6.5GB

Please see the dotted black line in the graph, which connects all the bottom points. You can notice that black line is going at 15°. This indicates that this application is suffering from an acute memory leak. If this application runs for a prolonged period, it will experience OutOfMemoryError. However in our performance labs, we don’t run the application for a long period.?

When this application is released in to production, you will see the below behavior:

Fig: Application Suffering from OutOfMemoryError in Production

In the above graph, towards the right side, you can notice that Full GC events are continuously running, however memory size doesn’t drop. It’s a clear indication that application is suffering from memory leak. When this pattern happens, already customers would have been impacted and it’s too late to catch the problem.

Thus, observing GC behavior in the performance lab, will facilitate you to catch OutOfMemoryErrors early in the game.?

2. Object Creation Rate

In Performance Labs we measure the overall memory utilization. Let’s say you have configured your application’s max memory to be 6GB, then you will observe peak memory utilization reach close to 6GB. Even if your developer has committed an inefficient code in the new release, you will still see the overall memory utilization reach only up to 6GB as well. Thus, measuring overall memory utilization might not be as insightful as studying the Object Creation Rate micro-metric.?

Object Creation Rate reports the amount of objects created in a unit of time. Example: 100 mb/sec. Say in your previous release, your object creation rate was 50 mb/sec , where as in the new release it has become 100 mb/sec, then it indicates that your application is creating 2x more objects to service the same workload. It clearly shows that some inefficient code has been committed by the developer in the new release. This will manifest as high CPU consumption and degraded response time in the production environment. Thus, studying Object Creation Rate will help you to catch memory inefficiencies early in the game.

3. GC Throughput

Fig: Garbage Collection related micro-metrics

Whenever a Garbage Collection event runs, it pauses your application. During this time, no customer transactions will be processed. GC Throughput is a key micro metric that indicates the percentage of time application spends on processing customer transactions relative to the GC activities.?

Say in your previous release if GC Throughput was 99% and in the current release it has become 95%. It indicates that in your previous release your application was spending 99% of it’s time in processing customer transactions and only 1% of time in processing GC activities, whereas in the current release its spending 95% of time in processing customer transactions and remaining 5% of time in GC activities. It means application’s GC performance has degraded in the new release. If GC Throughput degrades in the new release, it indicates higher object allocation rates, increased GC frequency, or inefficient heap utilization, all of which can degrade response times in production. Thus, tracking GC Throughput micro-metric in the performance labs helps you to catch potential slowdowns before they impact real users.

4. GC Pause Time

GC Pause Time is an another key micro-metric that indicates how much time your application is paused during GC events. Here we want to track two metrics:

a.?????Average GC Pause Time: This micro metric reports the average pause time of all GC events that ran in the application.

b.?????Max GC Pause Time: This micro metric reports the maximum pause time of all the GC events that ran in the application.?

Studying both the average and max GC pause times across releases ensures that the new release doesn’t introduce performance regression. If GC pause times increase in the new release, it often signals an increased object churn, larger heap scans, or inefficient GC tuning, which can lead to production outages under load. Thus, monitoring GC Pause Time metrics in performance labs provides an early warning, helping you safeguard application responsiveness before you go-live.

5. Thread Patterns

Whenever there is a bottleneck in the application code, multiple threads in the application will navigate to the line of code that is causing the bottleneck. For example, if a developer has committed an inefficient SQL query that takes long time to complete then multiple threads will get stuck on the method which is making that SQL call.?

Thus, observing threads behavior to see whether they are accumulating on a particular line of code during performance tests can help to identify the poorly performing components of the application.

Fig: Thread Behavior reported as Flame Graph

One effective way to spot these patterns is through a flame graph visualization of the threads. Above flame graph represents an application experiencing a backend slowdown. You can see multiple threads originating from different code paths but ultimately getting stuck on the Oracle database call. ?For more details about this specific problem, learn from this real-world Oracle slowdown case study. ?

6. Thread States

Fig: Thread Count Summary

Studying the number of threads and their states can reveal early signs of performance issues. Threads in a JVM can be in one of the following states: NEW, RUNNABLE, BLOCKED, WAITING, TIMED_WAITING, TERMINATED. Learn more about Java Thread States from this post.?

Here are key thread metrics to track in the performance lab to uncover critical issues:

BLOCKED Threads Count: If a high number of threads are in BLOCKED state for a prolonged period, it could indicate contention over shared resources (e.g., database connections, file locks). This can lead to unresponsive application.
RUNNABLE Threads Count: A spike in RUNNABLE threads often drives up CPU consumption, especially under heavy load.
WAITING / TIMED_WAITING Threads Count: An excessive number of threads in WAITING or TIMED_WAITING states may signal inefficient task scheduling or thread pool over-allocation, resulting in degraded response times.

7. Threads Behavior within Thread Pool

Fig: Threads Behavior within Thread Pool

Each application has multiple internal thread pools. Example:?

Tomcat container thread pool – Handles incoming HTTP requests.
Quartz scheduler thread pool – Executes scheduled tasks.
JDBC connection pool threads – Manages database interactions.
Asynchronous worker pools – Processes background tasks, messaging queues, or event-driven operations.

Studying threads’ behavior & utilization within each thread pool is essential for the following reasons:

a. Under-Provisioned or Saturated Thread Pools: If a thread pool shows most threads in the RUNNABLE or BLOCKED state during performance tests, it often means the pool is under-provisioned, or there’s a bottleneck in task execution. It signals that threads are constantly busy or waiting for locks. Under production load, this could result in request queuing, timeouts, or even application unresponsiveness.

b. Over-Provisioned Thread Pools: If a thread pool consistently has a large number of threads in WAITING or TIMED_WAITING state, it usually means the pool is over-allocated. It signals that threads are sitting idle, waiting for tasks that rarely arrive. While it may seem harmless, idle threads consume memory and CPU context-switching resources.

Even if your application appears to have an adequate overall thread count, individual thread pool saturation or under-utilization can lead to performance bottlenecks or resource wastage. Both of them can surface as production outages under peak load.

8. TCP/IP Connection Count & States

Fig: TCP/IP connection States originating from the application

Modern enterprise applications communicate in various protocols: HTTP, HTTPS, JMS, Kafka, REST, SOAP, gRPC, WebSockets, FTP, SFTP, MQTT, AMQP, and proprietary TCP-based protocols with a wider range of external systems: Databases, Payment Gateways, AI platforms,…??

Lets say in the new release developer accidentally introduces too many (i.e. chatty calls) with external systems or fails to close the network connections properly. It has the potential to overwhelm the external systems and lead to outages, especially under peak production load. Thus, monitoring the number of TCP/IP connections and states (e.g., ESTABLISHED, TIME_WAIT, CLOSE_WAIT), will help us to detect:

Connection leaks – Connections that are not properly closed.
Chattiness – An excessive number of small, frequent calls to external systems.
Idle Connections – Connections left open without active use, consuming resources.

These issues often surface only under load, so tracking TCP/IP connection behavior in performance labs can expose risks before they escalate into production outages.

9. Error Trends in Application Log

Fig: Error Occurrences in the Application Log

Application logs are typically not studied in the performance labs. They are investigated only when something breaks in production. This reactive approach often leaves critical issues undiscovered in the performance lab.?

However, application logs contain a wealth of information. When it’s tightly examined it can provide proactive signals of problems that are introduced in the new release. One of the key metrics to track in the application log is the number and types of errors. When you see a surge in particular types of errors in the new release, it can reveal hidden inefficiencies, integration flaws, or new vulnerabilities that weren’t present in the earlier release.

How to Capture these Micro Metrics?

All these 9 critical micro metrics and several more metrics can be captured from one tool – yCrash. You can follow the instructions given here to capture these micro metrics.? In fact all the metrics screenshots used in the post are generated from the yCrash tool.?

Best Practices to Capture these Micro Metrics

Say you are running the performance test for a 1 hour window. Then run the yCrash tool two times during your test window:

In the middle of the test i.e. at the 30th minute
When the test ends i.e. at the 60th minute.?

In this way we can analyze what’s happening in the application during the peak traffic window and also when all the traffic is processed. This will give a complete 360-degree view of the application.

Conclusion

Hopefully these 9 micro-metrics will be helpful to catch several performance bottlenecks before they surface in production environments. Keep us posted, if you think there are other micro-metrics that have facilitated you to catch performance problems during testing time.?

9 Micro-Metrics That Forecast Production Outages in Performance Labs

yCrash

We solve Java Performance Problems in seconds

Micro Metrics to Monitor in Performance Labs

1. GC Behavior Pattern?

2. Object Creation Rate

3. GC Throughput

4. GC Pause Time

领英推荐

5. Thread Patterns

6. Thread States

7. Threads Behavior within Thread Pool

8. TCP/IP Connection Count & States

9. Error Trends in Application Log

How to Capture these Micro Metrics?

Best Practices to Capture these Micro Metrics

Conclusion

yCrash Weekly Bytes

2,169 位关注者

yCrash的更多文章

社区洞察

其他会员也浏览了

How Vector Packet Processing (VPP)Empower Asterfusion Marvell Octeon based Solution

What Defines an IMA Hosted Function Environment? Key Parameters Revealed?

Sun4u vs Sun4v- A Performance Comparison

Understanding Memory Stack in AUTOSAR: Modules and Functions

Resource Limits in Kubernetes

Unleash the potential of FPGA and software in high-frequency trading (HFT).

Avoid Noisy Neighbors in Kubernetes: A Deep Dive into Resource Quotas ??

Taints and Tolerations vs Node Affinity in Kubernetes

Embedded Talks Ep 4: Scheduling Algorithms

Multithreading and Multitasking: The Core of Operating System Functionality

Micro Metrics to Monitor in Performance Labs

1. GC Behavior Pattern?

2. Object Creation Rate

3. GC Throughput

4. GC Pause Time

领英推荐

5. Thread Patterns

6. Thread States

7. Threads Behavior within Thread Pool

8. TCP/IP Connection Count & States

9. Error Trends in Application Log

How to Capture these Micro Metrics?

Best Practices to Capture these Micro Metrics

Conclusion

yCrash Weekly Bytes

2,169 位关注者

yCrash的更多文章

What to Do When There Is a High Number of JVM GC Threads

Best Practices for GC Logging in Java Applications

7 Micro-Metrics That Predict Production Outages in Performance Labs Webinar

How to Perform Java Garbage Collection Analysis? (3 Easy Steps)

Top 5 Java Performance Problems

What is Java Garbage Collection?

Boomi Performance Tuning: Best Practices for Optimizing Your Process (Part 1)

Are Your Production Dumps Secure?

Hidden Benefits of Analyzing Java Garbage Collection

How to Fix Java Finalizer Queue Backup Issues in ServiceNow MID Server

社区洞察

其他会员也浏览了

How Vector Packet Processing (VPP)Empower Asterfusion Marvell Octeon based Solution

What Defines an IMA Hosted Function Environment? Key Parameters Revealed?

Sun4u vs Sun4v- A Performance Comparison

Understanding Memory Stack in AUTOSAR: Modules and Functions

Resource Limits in Kubernetes

Unleash the potential of FPGA and software in high-frequency trading (HFT).

Avoid Noisy Neighbors in Kubernetes: A Deep Dive into Resource Quotas ??

Taints and Tolerations vs Node Affinity in Kubernetes

Embedded Talks Ep 4: Scheduling Algorithms

Multithreading and Multitasking: The Core of Operating System Functionality