9 Micro-Metrics That Forecast Production Outages in Performance Labs
In enterprise grade applications, before the new release goes to production, performance tests are conducted. As part of this test, the Performance QA team studies various metrics to make sure there is no degradation in the application performance. They primarily study the following metrics:
I would like to categorize these metrics as macro metrics. Macro metrics are great; however, they have couple of shortcomings:
a. Performance Problems are Not Caught
Performance problems that we see in production do happen in our performance labs, however they happen at an acute scale and are not big enough to tip the threshold of macro metrics. Thus these acute degradations?goes unnoticed and manifests as major performance problems in production.?
b. Doesn’t Facilitate in Troubleshooting
Macro metrics to a major extent does not facilitate development teams to debug and troubleshoot the problems. Macro metrics will indicate that CPU consumption is high, but there will be no indication whether CPU consumption increased because of heavy Garbage Collection activity or thread looping problem or some other coding issue. Macro metric will indicate that there is a degradation in the response time, however it won’t indicate whether degradation is because of the locks in the application code or a backend connectivity issue…
Micro Metrics to Monitor in Performance Labs
Macro metrics should be complemented with micro metrics to address the above-mentioned shortcomings. In this article, I have listed 9 micro metrics that will facilitate to capture acute performance degradations in your new releases:?
Let’s review these Micro Metrics in detail in this post. Why they are important, what problems they bring to your visibility.?
1. GC Behavior Pattern?
Garbage Collection heavily influences the application’s performance, studying the Garbage Collection behavior will facilitate you to forecast the memory related bottlenecks.?
Fig: Garbage Collection Behavior of a Healthy Application
The above graph shows the GC behavior of a healthy application. You can see a beautiful saw-tooth GC pattern.?You can notice that when the heap usage reaches ~5.8GB, ‘Full GC’ event (red triangle) gets triggered. When the ‘Full GC’ event runs, memory utilization drops all the way to the bottom i.e., ~200MB. Please see the dotted black line in the graph which connects all the bottom points. You can notice this dotted black line is going at 0°. It indicates that the application is in a healthy state & not suffering from any sort of memory problems.?
Fig: Application Suffering from Acute Memory Leak in Performance Lab
Above is the garbage collection behavior of an application that is suffering from an acute memory leak. When an application suffers from this pattern, heap usage will climb up slowly, eventually resulting in OutOfMemoryError.?
In the above figure, you can notice that ‘Full GC’ (red triangle) event gets triggered when heap usage reaches around ~8GB. In the graph, you can also observe that amount of heap that full GC events could recover starts to decline over a period, i.e.
a. When the first Full GC event ran, heap usage dropped to 3.9GB?
b. When the second Full GC event ran, heap usage dropped only to 4.5GB
c. When the third Full GC event ran, heap usage dropped only to 5GB
d. When the final full GC event ran heap usage dropped only to 6.5GB
Please see the dotted black line in the graph, which connects all the bottom points. You can notice that black line is going at 15°. This indicates that this application is suffering from an acute memory leak. If this application runs for a prolonged period, it will experience OutOfMemoryError. However in our performance labs, we don’t run the application for a long period.?
When this application is released in to production, you will see the below behavior:
Fig: Application Suffering from OutOfMemoryError in Production
In the above graph, towards the right side, you can notice that Full GC events are continuously running, however memory size doesn’t drop. It’s a clear indication that application is suffering from memory leak. When this pattern happens, already customers would have been impacted and it’s too late to catch the problem.
Thus, observing GC behavior in the performance lab, will facilitate you to catch OutOfMemoryErrors early in the game.?
2. Object Creation Rate
In Performance Labs we measure the overall memory utilization. Let’s say you have configured your application’s max memory to be 6GB, then you will observe peak memory utilization reach close to 6GB. Even if your developer has committed an inefficient code in the new release, you will still see the overall memory utilization reach only up to 6GB as well. Thus, measuring overall memory utilization might not be as insightful as studying the Object Creation Rate micro-metric.?
Object Creation Rate reports the amount of objects created in a unit of time. Example: 100 mb/sec. Say in your previous release, your object creation rate was 50 mb/sec , where as in the new release it has become 100 mb/sec, then it indicates that your application is creating 2x more objects to service the same workload. It clearly shows that some inefficient code has been committed by the developer in the new release. This will manifest as high CPU consumption and degraded response time in the production environment. Thus, studying Object Creation Rate will help you to catch memory inefficiencies early in the game.
3. GC Throughput
Fig: Garbage Collection related micro-metrics
Whenever a Garbage Collection event runs, it pauses your application. During this time, no customer transactions will be processed. GC Throughput is a key micro metric that indicates the percentage of time application spends on processing customer transactions relative to the GC activities.?
Say in your previous release if GC Throughput was 99% and in the current release it has become 95%. It indicates that in your previous release your application was spending 99% of it’s time in processing customer transactions and only 1% of time in processing GC activities, whereas in the current release its spending 95% of time in processing customer transactions and remaining 5% of time in GC activities. It means application’s GC performance has degraded in the new release. If GC Throughput degrades in the new release, it indicates higher object allocation rates, increased GC frequency, or inefficient heap utilization, all of which can degrade response times in production. Thus, tracking GC Throughput micro-metric in the performance labs helps you to catch potential slowdowns before they impact real users.
4. GC Pause Time
GC Pause Time is an another key micro-metric that indicates how much time your application is paused during GC events. Here we want to track two metrics:
a.?????Average GC Pause Time: This micro metric reports the average pause time of all GC events that ran in the application.
b.?????Max GC Pause Time: This micro metric reports the maximum pause time of all the GC events that ran in the application.?
Studying both the average and max GC pause times across releases ensures that the new release doesn’t introduce performance regression. If GC pause times increase in the new release, it often signals an increased object churn, larger heap scans, or inefficient GC tuning, which can lead to production outages under load. Thus, monitoring GC Pause Time metrics in performance labs provides an early warning, helping you safeguard application responsiveness before you go-live.
领英推荐
5. Thread Patterns
Whenever there is a bottleneck in the application code, multiple threads in the application will navigate to the line of code that is causing the bottleneck. For example, if a developer has committed an inefficient SQL query that takes long time to complete then multiple threads will get stuck on the method which is making that SQL call.?
Thus, observing threads behavior to see whether they are accumulating on a particular line of code during performance tests can help to identify the poorly performing components of the application.
Fig: Thread Behavior reported as Flame Graph
One effective way to spot these patterns is through a flame graph visualization of the threads. Above flame graph represents an application experiencing a backend slowdown. You can see multiple threads originating from different code paths but ultimately getting stuck on the Oracle database call. ?For more details about this specific problem, learn from this real-world Oracle slowdown case study. ?
6. Thread States
Fig: Thread Count Summary
Studying the number of threads and their states can reveal early signs of performance issues. Threads in a JVM can be in one of the following states: NEW, RUNNABLE, BLOCKED, WAITING, TIMED_WAITING, TERMINATED. Learn more about Java Thread States from this post.?
Here are key thread metrics to track in the performance lab to uncover critical issues:
7. Threads Behavior within Thread Pool
Fig: Threads Behavior within Thread Pool
Each application has multiple internal thread pools. Example:?
Studying threads’ behavior & utilization within each thread pool is essential for the following reasons:
a. Under-Provisioned or Saturated Thread Pools: If a thread pool shows most threads in the RUNNABLE or BLOCKED state during performance tests, it often means the pool is under-provisioned, or there’s a bottleneck in task execution. It signals that threads are constantly busy or waiting for locks. Under production load, this could result in request queuing, timeouts, or even application unresponsiveness.
b. Over-Provisioned Thread Pools: If a thread pool consistently has a large number of threads in WAITING or TIMED_WAITING state, it usually means the pool is over-allocated. It signals that threads are sitting idle, waiting for tasks that rarely arrive. While it may seem harmless, idle threads consume memory and CPU context-switching resources.
Even if your application appears to have an adequate overall thread count, individual thread pool saturation or under-utilization can lead to performance bottlenecks or resource wastage. Both of them can surface as production outages under peak load.
8. TCP/IP Connection Count & States
Fig: TCP/IP connection States originating from the application
Modern enterprise applications communicate in various protocols: HTTP, HTTPS, JMS, Kafka, REST, SOAP, gRPC, WebSockets, FTP, SFTP, MQTT, AMQP, and proprietary TCP-based protocols with a wider range of external systems: Databases, Payment Gateways, AI platforms,…??
Lets say in the new release developer accidentally introduces too many (i.e. chatty calls) with external systems or fails to close the network connections properly. It has the potential to overwhelm the external systems and lead to outages, especially under peak production load. Thus, monitoring the number of TCP/IP connections and states (e.g., ESTABLISHED, TIME_WAIT, CLOSE_WAIT), will help us to detect:
These issues often surface only under load, so tracking TCP/IP connection behavior in performance labs can expose risks before they escalate into production outages.
9. Error Trends in Application Log
Fig: Error Occurrences in the Application Log
Application logs are typically not studied in the performance labs. They are investigated only when something breaks in production. This reactive approach often leaves critical issues undiscovered in the performance lab.?
However, application logs contain a wealth of information. When it’s tightly examined it can provide proactive signals of problems that are introduced in the new release. One of the key metrics to track in the application log is the number and types of errors. When you see a surge in particular types of errors in the new release, it can reveal hidden inefficiencies, integration flaws, or new vulnerabilities that weren’t present in the earlier release.
How to Capture these Micro Metrics?
All these 9 critical micro metrics and several more metrics can be captured from one tool – yCrash. You can follow the instructions given here to capture these micro metrics.? In fact all the metrics screenshots used in the post are generated from the yCrash tool.?
Best Practices to Capture these Micro Metrics
Say you are running the performance test for a 1 hour window. Then run the yCrash tool two times during your test window:
In this way we can analyze what’s happening in the application during the peak traffic window and also when all the traffic is processed. This will give a complete 360-degree view of the application.
Conclusion
Hopefully these 9 micro-metrics will be helpful to catch several performance bottlenecks before they surface in production environments. Keep us posted, if you think there are other micro-metrics that have facilitated you to catch performance problems during testing time.?
Developer
3 周Very informative