Wrapping up 13 years of performance engineering
Thirteen years ago, I fired off my CV to a few dozen organisations looking for my first job in IT. Months later, after continual rejection or not hearing anything at all, I got an interview for a "junior performance engineer" role at a consultancy called Qual IT. My job interview with Stijn Schepers involved reviewing his DVD collection, and somehow a week later I showed up for my first day. The rest, as they say, is history.
Having recently transitioned into the world of SRE, I thought this would be a good junction to stop and take a moment to reflect. What have I learned? What would I do differently? What helped me along the way? In this blog I will attempt to condense the key things I learned over the years as a performance engineer.
Look at the raw data
Stijn Schepers and I have said it again, and again, and again... and we kept saying it for a reason. If you don't look at raw data, you don't truly understand how the system is behaving. I've written about it here, touched on it here, and here.
I remember years ago load testing a set of REST API's used by a mobile application. Looking at the raw data on a scatterplot I could see a clear pattern. Most responses took ~2 seconds, some took ~4 seconds, a small amount took ~6 seconds and a tiny amount took ~8 seconds:
By seeing bands at two second intervals I was able to communicate the behaviour and we identified which components had two second timeout/retry cycles configured to find the issue.
Most performance engineers would have looked at a percentile response time over time chart, or possibly an average over time and seen something like this:
...and missed the pattern entirely. This is also what most load testing tools would show you out of the box. I've heard arguments over the years that you don't need raw data - you could see the same thing by looking at percentile distributions, or using the standard deviation, etc. At the end of the day, the raw data lays it all out in front of you. Any kind of aggregation relies on knowing what the pattern is in advance or leaves gaps that important information can slip through.
Divide and conquer to investigate issues
I started investigating performance bottlenecks under the mentorship of Neil Davies. His analogy was that software solutions are made up of lots of little "factories" which can produce goods at a certain rate. The factory which produces at the slowest rate is going to become the first bottleneck in the solution. If you like mathematics (I'm not good at it) then we're talking about queuing theory.
It's more than mathematics though. In order to think about a solution you have to understand all the components which make up the whole. Asking questions and drawing the solution for myself has been one of the most valuable things I've learned in my career. This "reverse architecting" approach is helpful for any sort of IT work, not just performance engineering.
If you have a performance issue, map out all the components in the solution and understand the flow. Then, pick a point in the middle of the solution and grab timings there. For example, in the simplified example below we picked a load balancer and we grabbed the access logs from it:
If the issue (e.g. slowness) is at this load balancer layer then we know that the issue is either here, or downstream. If the issue is not occurring here, we know that the issue must be in the upstream web or load balancer layers.
This "divide and conquer" approach is key to investigating bottlenecks and issues. In the real world systems are more complex than my example, but the approach is the same.
Write your own tools
As a performance engineer you will touch on an enormous variety of technologies, tools, and processes. It is inevitable that you will find yourself in situations where you don't have the right tool for the job. Or, the tools you have available are clumsy and not an ideal fit.
Part of being a performance engineer is having the ability to write your own tools when the need arises. Some situations in which I've written my own code include:
Find a generalist scripting language that has a wide variety of libraries. In recent years I've been using JavaScript (Node.js). In the past I used PowerShell, Python, and Perl. A scripting language is an important weapon in your arsenal to build the "glue" that holds your work together. There is also a lot of creativity and innovation that can occur when you build your own utilities.
Get it right manually before you try and automate
There has been a lot of focus in recent years in automating work. Doing anything and everything to "fit into the pipeline". Trying to automate before you have a clear and simple manual approach is like trying to run before you can walk.
As I've said publicly in my blog post The Myth of Continuous Performance Testing, there are many parts of performance engineering which can't be automated easily. My advice is to not compromise on the integrity of the work just to fit into someone else's idea of CI/CD.
Record what you learn
As with any role in Technology, the knowledge you learn is invaluable. You will do too much to remember all the details - so make sure you write it down. Knowledgebases are an extraordinary tool for both yourself and your team.
Find a tool that works for you and your organisation. I use Confluence, but there are better tools out there that allow you to enter knowledge in a chaotic manner, yet make it easily searchable and accessible. Some examples include Obsidian.md, Joplin, and MediaWiki (the technology that Wikipedia is built on).
Make sure the knowledge you record is easy to add to, easy to maintain, accessible, and is something the teams you work with will actually use. Not every team needs to use the same technology either. I recently posted about an idea some of my colleagues had about an internal Q&A site (like Stack Overflow) which could be used to index documentation and written knowledge from a variety of sources.
Putting down your knowledge in writing also enables others. If you have long running load test suites in your organisation put together a test execution guide for each one - including all the preparation, execution, and post-test steps (even if you automate a lot of this). Write it so that someone who doesn't have any context could run a test in your absence.
I've spoken about documentation and knowledge retention fairly frequently including this recent Performance Time podcast episode titled Oh No! Not Documentation!
Always think about the customer
Don't forget the reason you are testing and monitoring performance. It's easy to lose sight of the our purpose, which is to give the customer (whether internal or external) the performance we want for them. Anything that does not work toward that goal is unnecessary.
Regularly review what you are doing and assess whether it's actually providing customer value. If it's not, stop doing it. Some examples of things to watch out for:
领英推荐
The other thing to think about is how much effort you're putting into testing in pre-production environments and how closely that matches what happens in the real world. All testing we do is an approximation, and it's important to verify our findings in production (whether it be through monitoring, or production testing, or both).
Use a risk based approach (and *actually* do it)
I hear the term "risk based" bandied about in the industry all the time, but it's often just an empty buzzword. Performance engineering (more than just about anything else) is often about avoiding terrible things (i.e. risk) which makes a risk based approach to the work ideal.
Before you build, test, monitor, or investigate anything - assess the risk. Think about the components of a solution, the different customers who will use it, the anticipated workload for each service, the business criticality of each service, the technical complexity involved in testing or monitoring each component or integration. Before you do this analysis, how can you meaningfully decide on your approach? Far too often I see performance testing strategies pushed out based on a rigid template that doesn't do anything to actually address the specific risk of the given situation.
Be an advocate
You will be pressured throughout your career to complete performance testing just to check a box before a release. There is nothing more demoralizing than a death march of complicated performance testing work which isn't providing any value.
The general awareness of performance engineering and maturity level in many organisations is low. It's your duty to raise the profile of performance and to promote good practice. No-one else is going to do it for you.
Learn how to speak to different stakeholders including leadership, business, and different groups of engineers. Work on your written and verbal presentation skills. Do your best to inspire others and get them interested in performance, and demonstrate what great performance engineering looks like.
Don't horde knowledge about performance engineering in your head - the more you share it, the better everyone is, including you.
Don't just automate, simplify
As with any part of technology, performance engineers are continually looking for ways to be more efficient. There aren't enough of us to go around and as system complexity rises, the problem becomes ever more challenging.
Automation is often seen as a silver bullet to solve the problem. Got too much work on for your performance team to handle? Just automate their test execution and you can throw more work their way! This, of course, is fiction. Automation is one tool at our disposal, but there are so many others that often aren't considered.
I have found the best gains in efficiency have come not from automation, but from simplifying my work. What do I mean by simplifying? Here are some examples:
As I mentioned earlier, bring it all back to the customer. Is the test or activity you are doing working to protect and improve the customer experience? If not, consider trimming it.
Find mentors and peers to help you grow
In the beginning it's challenging to know where to go next to improve yourself without someone to guide you. Try and find a mentor, someone to point you in the right direction. In time you'll have enough of a foundation to be self-led in your learning, but not right away.
When you are more independent in your work and learning find peers in the industry who you can bounce ideas off. There is nothing better than having someone to call when you find yourself in a situation you can't figure out. It's invaluable, and has saved my bacon several times over the years.
If you're a senior engineer, please share your knowledge with the next generation. If you can bring just two new capable performance engineers into the industry during your career then you're leaving it significantly better than when you started.
Think beyond load testing
Load testing is becoming a decreasingly important part of the performance engineers work. Load testing is time consuming and the complexity of doing it is rising along with system complexity.
There are many other techniques available that can help you answer questions about performance risk and I strongly recommend spending more time in activities such as:
If you are still mostly doing load testing, make sure you expand your toolkit. Don't get left behind. Being good at writing scripts in JMeter or LoadRunner isn't going to be enough in the future.
Care, but not too much
This is one pretty personal to me, but I know many people experience anxiety to varying degrees in our field. Look after yourself, keep perspective. No-one ever lay on their death bed wishing they'd worked a little bit more.
If you are someone who has a tendency toward anxiety, this is something you'll need to work at continually.
Acknowledgements
I've been extremely fortunate to have worked alongside and interacted with some extraordinary performance engineers in my career. I wanted to mention a few in particular:
In closing
I'm not closing the door on performance engineering. I moved to SRE to pick up more generalist engineering skills to open more doors in the future, and to hopefully have a bigger impact with the work I do. You never know though, I could be back trawling through logs and pinpointing issues in the future.
System Engineer
2 年This is interesting. I've been working performance for a similar time and very rarely use JMeter or Loadrunner - not because they're bad tools, so as much load testing isnt the most useful activity to understand how systems perform. Many of the performance issues i work on only appear in production environments - because UAT/Test environments are hosted on different hardware, with different workloads, different network topologies, and whole bunch of monitoring, prod loadbalancer, firewall, CDN configurations etc.
Performance | Automation | DevOps | PMP
2 年Best of luck...13 years of experience well put up Learning is not a destination, it's a journey
Principal Performance Engineer | AWS | Azure Certified
2 年Great summary. I see most of the people move from Performance engineering to SRE, what’s the main reason, is it just the interest or do we see less opportunities of growth in performance engineering after some years?
Principal Data Performance Engineer @ Emirates
2 年Pure gold Stephen! I'm sure your SRE journey will be smoother with such a vast Performance Engineering baggage along you ???
Senior Manager - Products
2 年Could relate and wonderful picturization too, Sensei-ni-Rei. ??