Impact of GenAI on Site Reliability Engineering (SRE)
According to Gartner, Inc. by 2027, 75% of enterprises will use site reliability engineering (SRE) practices across their organizations to optimize product design, cost and operations to meet customer expectations and by 2025, 40% of organizations will implement chaos engineering practices as part of site reliability engineering initiatives, improving mean time to repair (MTTR) by an average of 90%*. Evidently reliability becomes a strategic imperative for most enterprises. At the same time the last months companies have set very high in their strategic priority list Generative AI (GenAI). According to Accenture’s 2024 Technology Vision report**, 97% of global executives agree that foundation models will enable connections across data types, revolutionizing where and how AI is used. To operate in tomorrow’s market, businesses will need to lean on the full capabilities that generative AI provides. Is GenAI relevant for reliability? As already we have been introduced and started exploring the benefits of GenAI to the business we thought it is worth analyzing the impact & benefits of GenAI in increasing the reliability posture of an organization.
Site reliability engineering principles support product centricity enabling teams to design, build deploy and operate software and systems across the life cycle of the product while keeping pace with the demands of continuous delivery. At the same time SRE principles set the foundation for service quality and enable organizations to meet their goals for lowering costs of operations while at the same time keeping operational excellence and operational stability.
How can GenAI support become a catalyst for SRE transformations, support operational stability while fostering the market potential?
Generative AI goes beyond analyzing and classifying data towards creating something entirely new, including text, images, synthetic data and more. Leveraging GenAI’s ability to identify patterns, correlations from large datasets we deem GenAI as a potential enabler for several reliability use cases. The following section aims to outline the potential applications of GenAI in the context of Site Reliability Engineering. We carefully selected a few strong and relevant use cases to analyze.
Disaster Recovery plans and tests
Disaster recovery plans are typically developed based on experience of individual subject matter experts and a limited number of major incidents. Companies often face the challenge of maintaining the disaster recovery plans documentation up-to-date, testing DR procedures on regular intervals while also ensuring that diverse and continuously more complex disaster scenarios are tested. It would be of significant value to automate as many parts of this process as possible. What if we could utilize past incidents and problems documentation logged in the IT Service Management tool to when defining disaster scenarios and test scenarios for disaster recovery procedures? Generative AI can leverage incident/ problem management information available in the service management tool to create disaster scenarios and assist in crafting disaster recovery plans and test cases. Like this, organization will become more and more fluent not only in executing complex and diverse test cases but actually in testing data- driven scenarios leveraging historical incidents. That way the organization would advance their disaster response capabilities by training on near- real life disaster scenarios.
Incident Management and Response
Incident Management and Response is a critical capability of an SRE organization aiming at meeting the agreed SLAs, minimizing service disruption and increasing service quality. In that context, GenAI can support IT operations in different ways by leveraging existing large data sets in the IT Service Management Tool from prior incidents, problems or through the introduction of chatbots increasing the speed of providing updates and responses. GenAI can support in identifying patterns or correlating events and resolution approaches. Not only then can an organization identify early enough vulnerabilities and define preventive measures but can also expedite the incident response time and the mean time to restore by reducing manual intervention. Beyond supporting on the resolution approach GenAI can also support in faster client communication throughout the incident response process and faster status updates to inquiries. It is clear, that in this use case GenAI increases customer satisfaction through reduced time to respond (even time to resolve) as well as efficient communication and less manual errors. At the same time the internal organization can benefit from automated updates of their knowledge base to expand better documentation of the incident resolution notes.
领英推荐
Documentation creation
To create the baseline for SRE it is very important to have accurate and up-to-date documentation of the services. Generative AI can be used to analyze documentation from various data sources and create an initial service model. Usually the preparation of the documentation for Service modelling is following a highly manual process relying also heavily on the knowledge of individual subject matter experts and dispersed architecture documentation. During the next phases of maintenance of the service catalogue, GenAI can be used to synthesize information of different sources, map them to the right template part and ensure documentation is up to date. Such an application of GenAI will not only free up time to the experts for other tasks but also ensure that the documentation meets specific quality standards and is kept up to date. Of course it is essential that quality assurance is performed by human.
Conclusion and Summary
The extensive capabilities of GenAI in analysis of data and patterns can also be proven beneficial in other areas of resilience. Let’s look into network anomaly detection; GenAI can analyze large network- related data sets and define certain patterns as “normal network behavior” being then capable to define deviating patterns as anomalies, trigger then alerts and define corrective measures. Another interesting use case is how GenAI can implement self-healing to address common issues in an automated manner. There are way more use cases where GenAI can support operational stability that we will explore in the future.
What should not be overlooked in the application of GenAI are the rest of the critical capabilities an organization needs to develop in order to be able to seize the benefits of GenAI. Stakeholder enablement is critical for ensuring quality assurance of the results that GenAI is producing but also for developing the GenAI capabilities further, identifying and setting up further use cases where Generative AI can act as an enabler. In Summary, GenAI supports multiple use-cases in general but also specifically for Site Reliability Engineering. Given the efforts for developing comprehensive DR plans, have accurate documentation available all the time and enable automated self-healing capabilities, GenAI can bridge the gap that many enterprises have.
We are looking forward to your inputs and perspectives. Feel free to reach out!
Sources:
·???????? **Technology Trends 2024 | Tech Vision | Accenture
·???????? ***AI for everyone
IT Operations Architect
9 个月Thank you for this insightful article, Kyra. The recent development of a variety of data solutions—such as vector databases, time series databases, and VictoriaMetrics—along with the advent of cheaper storage options, has revolutionized the way we handle and analyze data. Additionally, the widespread adoption of APIs across services and applications has significantly enhanced the interoperability and accessibility of data.These technological advancements have made it more feasible than ever for SRE teams to leverage GenAI and other advanced models.
Software Engineer | SRE
11 个月Thanks for sharing.
Cloud Container | Big Data | Telco/5G | Analytics | GenAI Engineering Leader, Global Professional Services, Dell Technologies
12 个月Great thought, thanks for sharing. #genai can certainly add a lot of value to Site Reliability Engineering. One of the key job of an SRE is to write code/ develop automation to ensure that either a problem does not repeat or if it did then there is no need for human intervention to fix it. GenAI can be a force multiplier in writing those codes at an accelerated pace. There are many GenAI tools from GitHub Copilot to the latest StarCoder2 which can help.
Experienced Managing Director | Member of Accenture's Global Leadership Council | Pride Champion | Executive MBA | ETH
12 个月I'm fascinated by the potential of GenAI to revolutionize SRE practices. The ability to automate routine tasks and identify patterns in complex systems could significantly improve efficiency and reliability.