How New Relic Fits into Performance Engineering and Load Testing
Rebecca Clinard
Performance Engineering | Observability | DataScience | OpenTelemetry | Technical Evangelist
I am Rebecca Clinard, as Solution Consultant at New Relic. My expertise is in Performance Engineering. What makes me an expert you say? I’ve made a Ton of mistakes and learned from them ;) I have been load testing and tuning web and mobile application deployments for most of my career and have successfully tuned deployments to reach upwards of 8X the initial scalability - sometimes even without even having to change any of the underlying hardware. Tuning a deployment is both an art and a science - it’s a reiterative process where you change code, server configurations and the deployment architecture to achieve maximum throughput & maintain response times’ SLA’s while not flooding the underlying hardware resources. Basically, a juggling act. It’s a challenge but it’s also FUN and REWARDING.
Here I’m going to share how New Relic is instrumental in your load testing practices. Once performance engineers have the opportunity to integrate the New Relic platform into their load testing cycles, they will become much more efficient and methodical.
Here is the typical performance engineering story:
As the load increases (virtual users are increased on the application, executing automated workflows) → the throughput increases → a bottleneck is approached → the throughput plateaus → the response times increase → the throughput decreases → errors (timeout's, 500's) occur.
Prescriptive performance engineering processes for methodical load testing and root cause analysis:
1.Baseline the application using a low user load. For example, start with just 5 concurrent users. Determine if all business transactions meet their SLA response time requirements.
TIP: If the results of your baseline load test show that transactions are already exceeding the SLA’s, stop right there. There’s no reason to load test for scalability. Bring the application back to development stages. If the baseline test passes, proceed to the next step.
From the baseline load test results, set a custom acceptable Apdex score for each of the instrumented applications. For example, the average response time from the APM could be .68 seconds and the average response time from Browser is 1.4 seconds.
2. Design and execute a slow ramping load test. For example, start the load test with 5 users and add 5 users every 15 seconds. A staircase load test will slowly approach the point of performance degradation which will in turn make the analysis of the results easier.
TIP: If your application has a goal to reach 5000 concurrent users, first design a load test to reach half that target goal. If the application scales successfully to the halved target load, then proceed and design the next test to double the load. Always be methodical in designing your load tests - never throw the target workload onto an application because this will give chaotic and results which are difficult to interpret.
3. If the load testing goal is throughput based, rather than “user” or session based, use the same approach to softly reach the target TPS. For example, if the API throughput goal is 200 TPS, start with a ramping load test to reach 100 TPS.
4. Using the New Relic APM solution, in the Overview page, change the view to see percentiles and concentrate on the 95% line (instead of concentrating on averages with the default view, 95% is more sensitive and granular). Highlight and zoom into the timeframe from Just before the load test began TO the time where response time began to degrade. Keep this timeframe in context for the rest of your analysis across all Products (New Relic does this automatically - cool!)
TIP: The bottleneck could be a soft or hard limitation or code. The key to the performance engineering process is to identify the first occurring bottleneck. In your zoomed Do Not include the aftermath after the buckle point - these are all cascading symptoms should be differentiated from root causes. There is no value in analyzing symptoms.
The Next 4 steps can be done in the order which makes the most sense to you. You can start with Browser and progressively analyze to the Backend (Top approach) or you can start with Infrastructure and analyze up to the Browser (Bottom Up approach). The approach the performance engineer takes is completely dependent on their own theories. Again, performance engineering is both an art and a science.
5. Determine which application transactions are degrading and therefore causing the overall increase response times. This includes identifying which services (internal or external) are causing the degradation.
TIP: If multiple transactions are degrading with the exact same trend, it's usually a shared hard or soft resource which is approaching saturation.
6. Use New Relic APM to progressively isolate the code inefficiencies or error conditions. Use the transaction traces to isolate the exact code which is either degrading or throwing an error.
7. Use your New Relic infrastructure product to identify if any hard resources are becoming saturated (CPU, Memory, Network, etc) on each hosts/servers which are part of this deployment.
TIP: A hard resource doesn’t have to be completely saturated in order for response times to degrade. Could be even a 70% saturation. Correlation is required to understand the cause/effect.
If bottleneck is not a hardware resource, check for saturations of servers’ soft limitations including connection pools, datasource connections, TCP stack, etc).
TIP: Saturations of soft limitations will often show up as queuing in New Relic.
8. Use your Browser product to identify if increased response times are originating from the front end. Identify opportunities to make your pages more lightweight. For exaample, the rendering of assets. Identify is any Ajax requests to internal or 3rd parties which are causing slowdowns.
Let’s Begin the Tuning!
9. Tune or make a change in your deployment and send a New Relic deployment marker to mark the change. Tag this deployment marker with the details of your change (added 2 CPU's to a VM/new build #).
TIP: Tune 1 variable at a time. If you change 2 things at once (add more hard resources and double the JVM heap size), then you will not have a clear picture as to how each variable impacted overall scalability.
10. Repeat your ramping load test, view the results. See if results are the same (no difference and need to identify the correct bottleneck), better, or worse. Keep or revert your change.
TIP: Often, when you tune a variable and results get worse - congratulate yourself because you have just learned something very valuable about that resource as it pertains to the scalability of the deployment. It’s often the “failed” tunings that lead to the identify the limitations - by exaggerating a previously unknown limitation. Find the bottleneck is the Hard part, Alleviating it is the easy part!
11. Reiterative process: Ramp to next occurring bottleneck.
TIP: Performance Engineering is never “done” as the every component of a deployment, from the workload to the features to the architecture is in constant change.
New Relic features for methodical load testing:
- Service Maps to identify connectivity and Upstream and Downstream dependencies. Often used during the discovery phase for creating a performance test harness.
- In the Overview Page, use the Percentiles view while analyzing load test results. Concentrate on the 95% line to determine when overall response times show a trend in the increase.
- Deployment markers - mark each deployment change, each build and track exatly what has changed on the application side. Once you have a realistic performance test, don’t change the performance test harness (configurations such as run time settings). The performance test harness should remain the Constant and the application should be the Variable(s). Treat your approach like a science test.
- Create Key transactions for any business transactions which are expected (accepted) exceed the generic SLA of that application. For example, a long running report might have an Apdex of 5 seconds.
- Enable Distributed Tracing to have a clear visual of where the time is being spent across services.
- Click on a few key or important Transaction Traces after each load test which your team would like to keep for backward analysis (unclicked transaction traces are retained for 7 days).
- Create a custom Insights Dashboard with the KPI's you are always interested in watching during load tests. This way, analysis of load tests is made incredibly efficient and swift.
Hope you have learned a lot! Stay curious!
Performance Engineering | Observability | DataScience | OpenTelemetry | Technical Evangelist
5 年Thank you everyone who took the time to read!
Performance Engineering | Observability | DataScience | OpenTelemetry | Technical Evangelist
5 年Please share this post as well!