Types of Performance Data
Stephen Townshend
Tech Manager and Reliability, Observability and Resilience Engineer and Advocate
There's more to performance than response time.
In this blog we are going to look at the kinds of data we can look at to understand system performance, and where we can find it.
What performance data is there?
There's lots of different measurements that can tell us about software performance. The below diagram summarises some of the key ones:
Here's a brief explanation of each:
Response time is any measurement regarding how long something is taking. We often associate response time with user (or customer) experience. There's different scales of response time to consider:
- User response time, for example, submitting a search on a website.
- Component response time, for example, how long a particular API, line of code, or database query is taking.
- Processing time for longer events such as batch jobs.
Throughput is the volume of load our system is under within a certain time-frame. Examples of throughput include business transactions per hour, pages per hour, API requests per second, or Mbps of data transferred over a network.
Workload is more than just throughput, concurrency is not the volume of load but how it is applied. There's different kinds of concurrency to consider:
- Concurrent user sessions on a system and the memory footprint of each.
- The rate of arrival and whether requests are coming in concurrently or not.
The third aspect of workload which I haven't included in the diagram is the nature of the load. What are the specific business activities, API's, or transactions that our users (or consumers) are completing and what is the proportion of each?
Errors tell is about the stability of our system. The error messages themselves are often very useful. If you have an error in a log stating the application has run out of heap space, that's useful information.
There's also the rate of errors and when they occur. Is the error rate proportional to the load being applied? Does it change at different times of the day? Do certain errors occur at certain times more than others?
Server resources are the tip of the iceberg. There are countless resources we can monitor at an operating system and application level, but I suggest start with the basics. The four key hardware resources are processor, memory, network, and disk.
Lastly, I have mentioned queue length. Think of a software system as a giant sausage machine made up of lots of little sausage machines. Each of these is capable of producing sausages at a certain rate - maybe one sausage per second. If we try and shove meat into the machine faster than it can process, we'll get a backlog of meat hanging out the back - a queue.
There are intentional queues such as ActiveMQ or the Azure Service Bus, but there are also unintentional queues. For example, say our application is trying to process thousands of transactions but the CPU is 100% saturated. These transactions get queued up waiting for the CPU to be free. If we can monitor either the intentional or unintentional queues it can help us understand the bottlenecks and behaviours of our system.
Where can we find this data?
Performance data can be found in a lot of different places. Here are some of the most common:
Load testing tools are the most obvious place to look for performance data. Here we get response time and system behaviour metrics, some of these tools also capture server or application resources. But what if we want to drill down deeper? Or what if we want to look at a production system without running a test?
Server logs are a great source of performance data which I spoke about at length in my Neotys PAC talk. You'll either need a log analytics tool or you'll need to get your hands dirty and write some code to parse these logs in order to make sense of them.
Server resource monitoring tools are how we capture those basic hardware resources I mentioned earlier (processor, memory, disk, and network). There's hundreds of options, depending on your platform.
I often find a lot of value in querying the application database of the system under test. In most cases I can at least find workload information to help build a more accurate model, but often applications log performance metrics directly to a database - including timings.
If you have access to an application performance monitoring (APM) tool you'll probably be able to get very fine grained and detailed performance metrics about your system. These tools (generally speaking) have agents which listen in to every line of code or database query run in production or during a performance test.
In Summary
I've deliberately kept this blog simple. In my next blog I'll be talking about some of the considerations we need to make when interpreting this kind of performance data. It's one thing to have the data, quite another to understand how the system is actually behaving.
Very informative article .
Site Reliability Engineer Lead | DevOps Lead | Azure DevOps | AWS | Kubernetes | Terraform | CI/CD Automation | Cloud & Performance Engineering | Dubai Job Seeker
6 年Thanks?
A very clear and informative article. Thanks.