AIOps for IBM Z – A look into the Inspect Capability Area
In an earlier blog, I described our framework for accelerating a client’s journey to AIOps for IBM Z, and how this is essential for companies’ digital transformation. I discussed that the journey has four stages; Firefighting, Reactive, Proactive and Intelligent. I also covered the three capability areas of AIOps; Inspect, Evaluate, and Act, see below figure. In this blog, let’s explore best practices for the Inspect Capability Area.
The goal of the Inspect capability area is to identify potential issues before they disrupt your business. To accomplish this goal we need to major in three areas; Monitor our complete Infrastructure and end-to-end application performance, Generate alerts for incidents, and Apply analytics for early detection of anomalies. Let’s dive into each of the stages of inspect.
Firefighting
Customers finding themselves in firefighting mode focus on service restoration, resolving problems as they happen. As many problems are found later than would be ideal, such as in some cases first when end users raise concerns, the business impact of each issue is largely unknown and there is little prioritization of issues being worked. IBM Z is likely also managed as a silo, with little commonality with how operations is done in other parts of the company. Anti-patterns for this level include:
- Monitoring solution does not cover the full stack of technologies at hand, leaving blind spots. The organization and the tools they use are also siloed, leading to a lack of ownership across subsystems.
- Thresholds are not used, or are set and never changed, leaving many thresholds to be useless, creating either too few or too many alerts.
- Operators spend a lot of time looking at monitors, looking for problems, rather than relying on being alerted when they need to pay attention to an issue.
To get out of this mode, companies should look to adopt practices covered in the next section.
Reactive
Organizations moving from firefighting to reactive are investing in practices, skills and tools allowing them to identify problems in a more structured way, which will also mean you find some problems earlier. This investment in process improvement should pay off through increased operational efficiency and in improved SLAs as a result of improved resiliency. Practices for reactive include:
- Ensure you have appropriate coverage in your monitoring solutions, to avoid holes in your coverage, which in turn allows an issue in one area to go undetected until it spreads and has a more systemic impact to core business applications. This entails full-stack monitoring including middleware, APIs, JVMs, operating systems, hardware, storage, and networks.
- Improve usage of thresholds and rule-based alerts for your monitoring solution, so you no longer need to continuously have operators observing the monitors, and so you can get earlier notification of problems, when addressing them has a smaller impact on end users and SLAs.
- With the massive volume of workloads running on IBM Z, the number of events can be overwhelming. We hence recommend that you leverage notifications for non-critical events and alerts for critical events. Operators and SMEs can subscribe to the right level of events. As an example, subject matter experts may subscribe to notifications and leverage those for root cause analysis and to identify new opportunities for automation to avoid thresholds being breached.
- To effectively manage all critical alerts, incident tickets are manually created in an enterprise-wide support system with the necessary information. This helps you to ensure that all critical alerts are addressed.
- To avoid known defects, which can have a major impact on the resilience of your system, you should do preventive maintenance. We recommend that preventive maintenance is installed at least two to four times a year. In addition, we recommend that potentially high-impact fixes, such as HIPER, PE Fix, Security/Integrity and Pervasive PTFs be installed more frequently, see this article for more details.
Proactive
Organizations moving from reactive to proactive are adopting practices that help them detect incidents earlier, before they have a negative business impact. They are also maturing their best practices to handle the complexity of hybrid applications. Let’s have a look at some of the best practices we find in organizations in this stage of the journey to AIOps.
- Incident ticket creation is automated, providing consistency, such as the level of information included in the ticket, further ensuring that all critical alerts are addressed, and evaluation is speedy.
- New thresholds and rule-based alerts are created on an ongoing basis to avoid incidents that were missed and detected manually.
- Intensified monitoring is performed on regular intervals, e.g. using z/OS health checks, and remediations are put in place within the automation to inform via Incident Management. If unhealthy conditions are detected, appropriate remediations are taken, such as adding automation to reduce the risk of any adverse impact to the business.
- Business applications are monitored end-to-end across your hybrid cloud using Application Performance Management software which tracks a transaction as it goes from mobile through all platforms and subsystems. This radically reduces time in war rooms, as you can immediately understand where the source of slowdown in an application is, so you can contact the right SME for root cause analysis and problem resolution.
- Monitoring solutions across your hybrid cloud infrastructure are now feeding into a single pane of glass. This provides you with a consistent and shared understanding of the state of your entire hybrid cloud infrastructure. As your infrastructure is only as strong as your weakest link, this helps you to rapidly address issues no matter where they occur.
- Key Performance Indicators (KPIs) like Traffic, Latency, Saturation, Errors are used to monitor health check of systems and applications and to quickly identify issues. This provides clarity for operators and ensures a level of consistency with other platforms, as these KPIs are increasingly becoming industry standard.
- Any change to a KPI, whether informational, warning, or critical, results in an event that is generated automatically and delivered to a central event management system where statistical analysis is possible.
- Monitoring tools are viewed as a critical to the business, never turned off, and are set up for redundancy to avoid going down during an outage.
Intelligent
In the Intelligent stage, the focus is on continuous improvement. You also continue to integrate the practices and management environments for Inspect, Evaluate and Act into one integrated solution. While intelligence and Machine Learning may have been present in a previous stage for specific narrow applications, we now find a more pervasive adoption of Machine Learning. You understand what is normal for your systems by establishing a baseline, look for anomalies, find trends, and forecast problems so you can remediate them before they become a service disruption. The combination of all the above provides you with the ability to rapidly respond to more and more issues before they impact your business.
- Dynamic and Intelligent thresholds are set automatically by AI agents, with awareness of periodicity and importance of applications, and are used to identify issues and anomalies. This can radically improve the quality of alerts, including reducing alert noise, by avoiding unnecessary alerts being raised, while reducing the number of cases in which an alert has not been raised even though there is reason for concern.
- Track responsiveness to alerts, through mechanisms such as responsiveness to paging. This allows you to refine processes used, including call duty, to ensure responsiveness is continuously improved and within SLAs.
- Machine Learning is leveraged to understand what normal looks like for your organization. This enables real-time scoring of KPIs and analysis of logs so you can detect anomalies before they disrupt your business. These anomalies have associated alerts, allowing operators and SMEs to be alerted to any anomalous behavior of logs or metrics.
- Problem signatures, enabling early identification of specific critical issues, are identified on an ongoing basis, and machine learning algorithms are trained for new problem signatures to map anomalous behavior to the corresponding problem signature. This enables operators to not only be alerted to an anomaly, but also forecast when a threshold may be breached, and understand what the likely root cause is, with guidance on how to fix the problem.
Conclusion
The journey to AIOps is exciting. In our assessments, we find that most companies are in the Reactive stage with some being in the Proactive stage. We are also seeing some leaders aggressively moving towards Intelligent. The practices for each area are well defined. The journey still takes some time, but companies doing the investments are advancing fast on their journey. As a next step, I suggest you check out the blog from Nathan Brice on the Evaluate Capability Area and enjoy your journey to AIOps!
ExIBMer, ex鹅
4 年Very good summary, thanks Per !
Retired Software Strategist from IBM (with IBM 45 years)
4 年Good read Per.