登录查看更多内容

Wrapping up 13 years of performance engineering

Stephen Townshend

Reliability and DevOps Engineer, Advocate and Leader

发布日期: 2022年2月7日

Thirteen years ago, I fired off my CV to a few dozen organisations looking for my first job in IT. Months later, after continual rejection or not hearing anything at all, I got an interview for a "junior performance engineer" role at a consultancy called Qual IT. My job interview with Stijn Schepers involved reviewing his DVD collection, and somehow a week later I showed up for my first day. The rest, as they say, is history.

Having recently transitioned into the world of SRE, I thought this would be a good junction to stop and take a moment to reflect. What have I learned? What would I do differently? What helped me along the way? In this blog I will attempt to condense the key things I learned over the years as a performance engineer.

Look at the raw data

Stijn Schepers and I have said it again, and again, and again... and we kept saying it for a reason. If you don't look at raw data, you don't truly understand how the system is behaving. I've written about it here, touched on it here, and here.

I remember years ago load testing a set of REST API's used by a mobile application. Looking at the raw data on a scatterplot I could see a clear pattern. Most responses took ~2 seconds, some took ~4 seconds, a small amount took ~6 seconds and a tiny amount took ~8 seconds:

By seeing bands at two second intervals I was able to communicate the behaviour and we identified which components had two second timeout/retry cycles configured to find the issue.

Most performance engineers would have looked at a percentile response time over time chart, or possibly an average over time and seen something like this:

...and missed the pattern entirely. This is also what most load testing tools would show you out of the box. I've heard arguments over the years that you don't need raw data - you could see the same thing by looking at percentile distributions, or using the standard deviation, etc. At the end of the day, the raw data lays it all out in front of you. Any kind of aggregation relies on knowing what the pattern is in advance or leaves gaps that important information can slip through.

Divide and conquer to investigate issues

I started investigating performance bottlenecks under the mentorship of Neil Davies. His analogy was that software solutions are made up of lots of little "factories" which can produce goods at a certain rate. The factory which produces at the slowest rate is going to become the first bottleneck in the solution. If you like mathematics (I'm not good at it) then we're talking about queuing theory.

It's more than mathematics though. In order to think about a solution you have to understand all the components which make up the whole. Asking questions and drawing the solution for myself has been one of the most valuable things I've learned in my career. This "reverse architecting" approach is helpful for any sort of IT work, not just performance engineering.

If you have a performance issue, map out all the components in the solution and understand the flow. Then, pick a point in the middle of the solution and grab timings there. For example, in the simplified example below we picked a load balancer and we grabbed the access logs from it:

If the issue (e.g. slowness) is at this load balancer layer then we know that the issue is either here, or downstream. If the issue is not occurring here, we know that the issue must be in the upstream web or load balancer layers.

This "divide and conquer" approach is key to investigating bottlenecks and issues. In the real world systems are more complex than my example, but the approach is the same.

Write your own tools

As a performance engineer you will touch on an enormous variety of technologies, tools, and processes. It is inevitable that you will find yourself in situations where you don't have the right tool for the job. Or, the tools you have available are clumsy and not an ideal fit.

Yeah, ok, I didn't build tools *quite* like the Space Shuttle...

Part of being a performance engineer is having the ability to write your own tools when the need arises. Some situations in which I've written my own code include:

When I needed to transform and store data. This was the most common situation I had for writing my own tools. For example, a script to parse web server access logs into a table of data I could analyse.
When you need to monitor a component which doesn't have existing monitoring set up. For example, a script which talks to a bunch of Linux servers via SSH and runs the vmstat command to collect CPU usage data. This was extremely useful when I came in as an external consultant because it was agentless. I didn't need to install anything on the servers I needed to monitor.
If, like me, you use a lot of open source tooling then there are often gaps you need to fill. One example for me was the lack of a complete and effective test data management solution in JMeter (that supports consumable data, random access, etc). I wrote my own test data management platform similar to the Simple Table Server (STS) plugin, but as a standalone Node.js app that ran 24/7 and could be shared by others in the company.
For automating repetitive tasks. For example, I wrote PowerShell scripts for querying system databases for fresh test data and then updating my datapools prior to a test.

Find a generalist scripting language that has a wide variety of libraries. In recent years I've been using JavaScript (Node.js). In the past I used PowerShell, Python, and Perl. A scripting language is an important weapon in your arsenal to build the "glue" that holds your work together. There is also a lot of creativity and innovation that can occur when you build your own utilities.

Get it right manually before you try and automate

There has been a lot of focus in recent years in automating work. Doing anything and everything to "fit into the pipeline". Trying to automate before you have a clear and simple manual approach is like trying to run before you can walk.

As I've said publicly in my blog post The Myth of Continuous Performance Testing, there are many parts of performance engineering which can't be automated easily. My advice is to not compromise on the integrity of the work just to fit into someone else's idea of CI/CD.

Record what you learn

As with any role in Technology, the knowledge you learn is invaluable. You will do too much to remember all the details - so make sure you write it down. Knowledgebases are an extraordinary tool for both yourself and your team.

Find a tool that works for you and your organisation. I use Confluence, but there are better tools out there that allow you to enter knowledge in a chaotic manner, yet make it easily searchable and accessible. Some examples include Obsidian.md, Joplin, and MediaWiki (the technology that Wikipedia is built on).

Make sure the knowledge you record is easy to add to, easy to maintain, accessible, and is something the teams you work with will actually use. Not every team needs to use the same technology either. I recently posted about an idea some of my colleagues had about an internal Q&A site (like Stack Overflow) which could be used to index documentation and written knowledge from a variety of sources.

Putting down your knowledge in writing also enables others. If you have long running load test suites in your organisation put together a test execution guide for each one - including all the preparation, execution, and post-test steps (even if you automate a lot of this). Write it so that someone who doesn't have any context could run a test in your absence.

I've spoken about documentation and knowledge retention fairly frequently including this recent Performance Time podcast episode titled Oh No! Not Documentation!

Always think about the customer

Don't forget the reason you are testing and monitoring performance. It's easy to lose sight of the our purpose, which is to give the customer (whether internal or external) the performance we want for them. Anything that does not work toward that goal is unnecessary.

Regularly review what you are doing and assess whether it's actually providing customer value. If it's not, stop doing it. Some examples of things to watch out for:

Testing functionality which is barely used by customers. I heard of a team performance testing a function that was called three times an hour at peak load. Load testing is expensive, this isn't a good use of time.
Monitoring a lot of technical metrics but not being able to answer simple questions about the customer experience or behaviour. This is something I'm trying to tackle as an SRE by introducing SLO's, and it is pervasive in our industry. Start with tracking the key customer metrics. This kind of black box monitoring should be the first step. Only once you are able to see the customer experience should you expand your observability to help you investigate and diagnose issues faster.

Sergio Alvaré Peláez 2 个月前

The Observability Revolution: Extracting Insights at…

Yoseph Reuveni 2 个月前

The Crucial Role of Site Reliability Engineering (SRE)…

Shanthi Kumar V - Build your AI Career W/Global Coach-AICXOs scaling 1 个月前

The other thing to think about is how much effort you're putting into testing in pre-production environments and how closely that matches what happens in the real world. All testing we do is an approximation, and it's important to verify our findings in production (whether it be through monitoring, or production testing, or both).

Use a risk based approach (and actually do it)

I hear the term "risk based" bandied about in the industry all the time, but it's often just an empty buzzword. Performance engineering (more than just about anything else) is often about avoiding terrible things (i.e. risk) which makes a risk based approach to the work ideal.

Before you build, test, monitor, or investigate anything - assess the risk. Think about the components of a solution, the different customers who will use it, the anticipated workload for each service, the business criticality of each service, the technical complexity involved in testing or monitoring each component or integration. Before you do this analysis, how can you meaningfully decide on your approach? Far too often I see performance testing strategies pushed out based on a rigid template that doesn't do anything to actually address the specific risk of the given situation.

Be an advocate

You will be pressured throughout your career to complete performance testing just to check a box before a release. There is nothing more demoralizing than a death march of complicated performance testing work which isn't providing any value.

The general awareness of performance engineering and maturity level in many organisations is low. It's your duty to raise the profile of performance and to promote good practice. No-one else is going to do it for you.

Learn how to speak to different stakeholders including leadership, business, and different groups of engineers. Work on your written and verbal presentation skills. Do your best to inspire others and get them interested in performance, and demonstrate what great performance engineering looks like.

Don't horde knowledge about performance engineering in your head - the more you share it, the better everyone is, including you.

Don't just automate, simplify

As with any part of technology, performance engineers are continually looking for ways to be more efficient. There aren't enough of us to go around and as system complexity rises, the problem becomes ever more challenging.

Automation is often seen as a silver bullet to solve the problem. Got too much work on for your performance team to handle? Just automate their test execution and you can throw more work their way! This, of course, is fiction. Automation is one tool at our disposal, but there are so many others that often aren't considered.

I have found the best gains in efficiency have come not from automation, but from simplifying my work. What do I mean by simplifying? Here are some examples:

Review the scope of your testing. Do you need to continue including all the tests you have run in the past? Are they still providing value? Make sure you're covering happy path scenarios (unless there is a specific reason to target alternative scenarios).
Would there be value in replacing downstream integrations with stubs / mocked endpoints? These significantly reduce the complexity of managing test data and breaks your dependence on other teams and systems being available to run your tests.
Can you change your test approach to make it simpler to build and maintain? For example, utilizing UI automation for particularly complex web applications, or triggering API's instead of a UI front end.

As I mentioned earlier, bring it all back to the customer. Is the test or activity you are doing working to protect and improve the customer experience? If not, consider trimming it.

Find mentors and peers to help you grow

In the beginning it's challenging to know where to go next to improve yourself without someone to guide you. Try and find a mentor, someone to point you in the right direction. In time you'll have enough of a foundation to be self-led in your learning, but not right away.

When you are more independent in your work and learning find peers in the industry who you can bounce ideas off. There is nothing better than having someone to call when you find yourself in a situation you can't figure out. It's invaluable, and has saved my bacon several times over the years.

If you're a senior engineer, please share your knowledge with the next generation. If you can bring just two new capable performance engineers into the industry during your career then you're leaving it significantly better than when you started.

Think beyond load testing

Load testing is becoming a decreasingly important part of the performance engineers work. Load testing is time consuming and the complexity of doing it is rising along with system complexity.

There are many other techniques available that can help you answer questions about performance risk and I strongly recommend spending more time in activities such as:

Utilising functional test automation to capture single user timings. It's not perfect at all, as it doesn't give us confidence due to the small sample size, and the results don't tell us anything about capacity or stability. However, they tell us something about response times and let us leverage work that is already being done. It gives feedback early and often.
Single user profiling. Capture the HTTP traffic using your browser's developer tools, or a web proxy like Fiddler. There is a lot you can learn from this. Are there large page resources slowing things down? Is client side caching enabled? Compression? Which requests are taking a long time? Are requests being triggered in parallel or sequentially?
Utilize APM, tracing, or logging tools in a pre-prod (or prod) environment. Even without substantial load you can learn about the behaviour of the system within the context of a single user.
Do more in production. APM, synthetic transactions, browser monitoring (RUM), production performance testing, even think about risk reducing techniques like canary releases.

If you are still mostly doing load testing, make sure you expand your toolkit. Don't get left behind. Being good at writing scripts in JMeter or LoadRunner isn't going to be enough in the future.

Care, but not too much

This is one pretty personal to me, but I know many people experience anxiety to varying degrees in our field. Look after yourself, keep perspective. No-one ever lay on their death bed wishing they'd worked a little bit more.

If you are someone who has a tendency toward anxiety, this is something you'll need to work at continually.

Acknowledgements

I've been extremely fortunate to have worked alongside and interacted with some extraordinary performance engineers in my career. I wanted to mention a few in particular:

Stijn Schepers for giving me the start I needed in my career, and for mentoring me in those early years. It's been a great joy to be able to catch up with you in Europe during the Neotys PAC conferences over the years. I know you have also transitioned into a new field, and I wish you all the best.
Neil Davies for pushing me to be a true performance engineer. You showed me how to investigate and identify bottlenecks, gave me a love for data and visualizations, showed me I could build my own tools and utilities, and how to reverse engineer solutions. You pushed me further than I ever could alone and showed by the breadth and depth of the field of software performance.
Richard Leeke for introducing Tableau to the performance engineering community, and for building the extraordinary community of engineers that we are lucky to have in little old New Zealand. You have a rare combination of incredible knowledge and ability coupled with a down to earth humility that I admire.
Srivalli Aparna and Ben Rowan for being sounding boards over the years. I think the three of us have different but complimentary skills and experiences, and working alongside you both has made be a better engineer and person.
Andi Grabner, James Pulley, Alexander Podelko, Mark Tomlinson, Leandro Melendez, Henrik Rexed, Scott Moore, and Nicole van der Hoeven (among many others) for the guidance and content you create for the community. You make the community a thriving and exciting thing to be part of. What you do to raise the profile of performance engineering across the globe cannot be overestimated.
Henrik Rexed, Stephane Brunet, and all the others at Neotys who brought the Neotys PAC events to life. Travelling to Europe to speak at a Scottish Castle, the French Alps, and Santorini was truly extraordinary.
Paul Zhang, Cynthia Tan, Owen Hu, Raina Chand, and Gwen De Leon and the others I've trained over the years. Your fresh perspectives and energy has kept me young (on the inside).

In closing

I'm not closing the door on performance engineering. I moved to SRE to pick up more generalist engineering skills to open more doors in the future, and to hopefully have a bigger impact with the work I do. You never know though, I could be back trawling through logs and pinpointing issues in the future.

Peter Booth

System Engineer

2 年

This is interesting. I've been working performance for a similar time and very rarely use JMeter or Loadrunner - not because they're bad tools, so as much load testing isnt the most useful activity to understand how systems perform. Many of the performance issues i work on only appear in production environments - because UAT/Test environments are hosted on different hardware, with different workloads, different network topologies, and whole bunch of monitoring, prod loadbalancer, firewall, CDN configurations etc.

2 次回应

Gururaj R.

Performance | Automation | DevOps | PMP

2 年

Best of luck...13 years of experience well put up Learning is not a destination, it's a journey

2 次回应

Jeevan Deep Mankar

Principal Performance Engineer | AWS | Azure Certified

2 年

Great summary. I see most of the people move from Performance engineering to SRE, what’s the main reason, is it just the interest or do we see less opportunities of growth in performance engineering after some years?

1 次回应

Edoardo Varani

Principal Data Performance Engineer @ Emirates

2 年

Pure gold Stephen! I'm sure your SRE journey will be smoother with such a vast Performance Engineering baggage along you ???

2 次回应

Madhuraj Rajendran

Senior Manager - Products

2 年

Could relate and wonderful picturization too, Sensei-ni-Rei. ??

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Wrapping up 13 years of performance engineering

Stephen Townshend

Reliability and DevOps Engineer, Advocate and Leader

Look at the raw data

Divide and conquer to investigate issues

Write your own tools

Get it right manually before you try and automate

Record what you learn

Always think about the customer

领英推荐

Use a risk based approach (and actually do it)

Be an advocate

Don't just automate, simplify

Find mentors and peers to help you grow

Think beyond load testing

Care, but not too much

Acknowledgements

In closing

更多精彩文章

社区洞察

其他会员也浏览了

Diary Post: Sustaining Good Software Architecture

Enhanced Testing of SONiC NOS with BGP NetOps, Dual Stack and ECMP coverage

Architecture Weekly #122 - 10th April 2023

It's time for developer-driven reliability

Can Performance Engineering Replace Performance Testing In The Future?

December 18, 2020

My tryst with Chaos Engineering

Performance roles in client facing projects

The Case For Site Reliability Engineering

Chaos Engineering – Storage Saturation

Look at the raw data

Divide and conquer to investigate issues

Write your own tools

Get it right manually before you try and automate

Record what you learn

Always think about the customer

领英推荐

Use a risk based approach (and *actually* do it)

Be an advocate

Don't just automate, simplify

Find mentors and peers to help you grow

Think beyond load testing

Care, but not too much

Acknowledgements

In closing

Monitoring your Mac with Prometheus

2022年12月20日

Running your first Kubernetes workload in AWS with EKS

2022年12月14日

Containerising a Node.js app

2022年11月17日

A Year as an SRE

2022年11月10日

The HTTP Protocol (explained)

2022年8月30日

Running Grafana & Prometheus on Docker

2022年8月2日

Is cloud computing killing performance testing?

2022年3月8日

Performance Engineer to SRE?

2021年11月8日

Before you automate your performance testing…

2020年8月31日

What I will miss about Visual Studio Load Test

2018年12月18日

社区洞察

其他会员也浏览了

Diary Post: Sustaining Good Software Architecture

Enhanced Testing of SONiC NOS with BGP NetOps, Dual Stack and ECMP coverage

Architecture Weekly #122 - 10th April 2023

It's time for developer-driven reliability

Can Performance Engineering Replace Performance Testing In The Future?

December 18, 2020

My tryst with Chaos Engineering

Performance roles in client facing projects

The Case For Site Reliability Engineering

Chaos Engineering – Storage Saturation

Use a risk based approach (and actually do it)