Modernize your operations with Site Reliability Engineering
Majority of the enterprises today are looking into turnkey solutions to cope with IT operations and the restructuring demands brought about by digital transformation. Furthermore, with digital transformation, the paradigm has changed on how IT services are designed, developed, delivered, and operated.
For a couple of decades, Information Technology Infrastructure Library (ITIL) has been the leading IT Service Management framework adopted by enterprises across the globe. However, in addition to frameworks like ITIL, the importance of methodologies such as Dev(Sec)Ops are gaining much more importance every day, as organizations all over the world are exploring new emerging technologies and agile ways of working.
Released in 2019 with the latest update, ITIL 4 remains a relatively complex governance model with four dimensions, seven guiding principles, a service value system, and 34 processes (now renamed as practices). While ITIL 4 provides certain attention to Agile and DevOps, the framework covers more or less every aspect of software delivery and operations and seemingly is trying to be the single source of truth in both principles and practices for IT management.
That being said, agile development is not enough for the successful delivery of cloud-native/ready applications. As methods of developing, testing, and releasing new functions become more agile, service management must also transform to support this paradigm shift. IT operations for cloud-native applications are quite different when compared to traditional approaches.
Running cloud applications are also bringing in new challenges. For example, the operations focus of traditional IT is primarily on the infrastructure level. On the other hand, cloud-native applications run on commodity infrastructure, and utilize hybrid cloud based services that are ready and easy to consume via APIs. Accordingly, IT operations require a different approach to managing reliability and application scaling. The focus on resolving issues is now shifted to the application level and hence traditional operations, in this context, is no longer viable. It is also worth to note that the required transformation for service management will have implications on various areas such as the organization itself, processes, tools, and culture.
You Build it, You run it
Site Reliability Engineering (SRE) is bringing a new essence to all of those areas.
"SRE is an approach to operations that ensures that continuously delivered applications run efficiently and reliably by using software engineering and automation solutions." This includes a data-driven approach to operations, a culture of automation to drive efficiency and reduce risk, and hypothesis-driven methodology in incident, performance, and capacity tasks.
Breaking this down, the below set of guiding principles are key to mention for any implementation of SRE:
- Use automation to perform operations to scale with load.
- Have an SLA or SLO for the service and measure against it.
- Practice observability, including the four golden signals: latency, traffic, errors, and saturation.
- Use actionable, symptom-based alerts. To govern actions, use automated runbooks.
- Hold a blameless postmortem for every event.
Making the Shift
For many decades enterprises relied on monitoring, service desk and event management tools to manage traditional IT infrastructures. As IBM we had the opportunity to work with so many clients across the globe and have gained valuable insights as we continued to modernize our well-established IT Service Management portfolio.
The success of the SREs is dependent on the choices of tools any organization need to Plan, Create, Verify, Package, Release, Configure, and Monitor the software they build.
Leveraging the Red Hat OpenShift Container Platform, IBM Cloud Pak for MultiCloud Management, Watson AIOps as well as ecosystem offerings such as Sysdig, IBM is able to provide all the capabilities needed by SREs. Being cognizant of the role that organization, processes, and culture play as well, IBM Garage for Client Solution Acceleration offerings plays a pivotal role in aligning the organization around to meet the overall IT operations and business objectives.
I highly encourage you to visit the links I have provided below to learn more about SRE, ITIL 4 and IBM’s future proofing capabilities in these domains :
https://www.ibm.com/cloud/cloud-pak-for-management
https://www.ibm.com/products/watson-aiops
https://www.axelos.com/case-studies-and-white-papers/itil-4-and-digital-transformation