Investigating Performance Issues- Stick to the basics
Incidents and Problems occur in every organization. To effectively address these, it is crucial to identify and resolve underlying cause of these problems to achieve both long-term and short-term improvements.
While there are several RCA techniques that can be applied to investigate simple to complex problems, when dealing with complex problems involving multiple components such as compute, storage, and network, it is extremely important to follow the basic rules. This approach produces reliable and comprehensive results.
Investigating system performance issues requires a structured approach to identify and address underlying causes. One of the critical steps is analyzing various metrics to identify bottlenecks – it involves looking for any patterns, anomalies, spikes, dips, or trends that could indicate performance bottlenecks.
When investigating performance issues resulting from capacity bottlenecks, relying solely on extended average utilization reports particularly those generated on a daily or weekly basis may be misleading as averages may mask important details and peaks that occur throughout the day or at specific times of the day, potentially leading to inaccurate assessments and misdiagnoses.
As a result, using granular or real-time reports is very important for a more accurate understanding of system performance. These granular or real-time reports provide a more detailed and accurate assessment of system performance, allowing you to more effectively identify and address capacity issues.
Here's why granular or real-time reports are valuable:
领英推荐
?
Based on the report (chart-1) detailing hourly data, there is a peak in resource demand from 9 AM to 11 AM, which likely corresponds to the business’s peak hours. However, when considering the overall daily average utilization, it comes to at 43%. This daily average is highly likely to mislead the investigation. Similarly examining the daily reports for the past two weeks doesn't provide clarity either, as the average daily utilization remains consistently below 50%, leading to misleading investigation.?
As a result, one must always rely on real-time reports, paying close attention to the times when users are more likely to complain about slow response.
?