Gen-AI & SRE in focus of the C-suite agenda - learnings and recommendations.

Gen-AI & SRE in focus of the C-suite agenda - learnings and recommendations.

The SRE Report 2024 speaks of state of Site Reliability Engineering

The SRE Report 2024 | Catchpoint

  • 24% of organizations have breached a contractual service level agreement in the last 12 months. recommends relationship between IT objectives and business outcomes
  • 64% agree that monitoring productivity or experience-disruption endpoints is required: showcases paradigm shift in how reliability practitioners now think about visibility
  • 81% of organizations have two or more types of telemetry feeding their observability frameworks and 43% have four or more: confronts the fallacy of single tools being able to provide visibility for all stacks

Here are few insights from the report & correlation to my observations .


Working with few corporations ( financial ) over the years I see that C suite focusing more on deriving benefits from Gen AI & apply site reliability engineering practices but have a few fallacies that need to be first corrected.


Recommendation # 1 : ( SRE as a role and practice across SDLC ) The SRE practices go beyond just "optimizing operations" , these focus on optimizing system design. This is a common fallacy - focus on "myopic" objectives of post production operations . Get it right SRE ( as a science ) and its focus must germinate right from requirements . This is part of mindset shift .


Bring SRE embedded to troubleshoot issues inside the product team supported with Product mgmt. and Platform engineering as common .

SRE COE / chapter can be a horizontal but needs dedicated role in the feature teams .


Recommendation # 2 ( Identify system flow and customer journeys ) : System design is also about structures of teams . Teams should embed SRE and set up around flow of value . Hence adopt product centricity with empathy mapping & customer journeys to understand the existing flow , what are the impediments / blockers to customer in the touchpoints and where is efficiency losses on the way due to systemic issues on UI/UX design , performance irritants , etc..

Recommendation # 3 ( focus on business critical journey + loads of instrumentation ) Talking of flow - identify business critical journeys and add more instrumentation on the way behind the scenes . This telemetry of closely connected tools would yield insights to detect system anomalies that can be isolated and hence consume less of error budget. The platform engineering team helps sets up these tools to help SRE localization of issues.


Recommendation # 4 ( mindset ) Increasingly , it is felt the best design is when SREs are involved right from the beginning ( design , architecture ) and NFRs - security , performance , scalability and Infra capacity planning , all requirements & activities needs thinking much before code is written and tested.

Organizations to inculcate the mindset of SRE with below shifts


Difference between Traditional Ops mindset and SRE mindset


Recommendation # 5 ( Bring in AI )

SRE focuses on delivery and the stability of the production environment, while DevOps focuses on the end-to-end application lifecycle .If Devops was considered as a class , SRE implements that class. SRE is now going beyond using AIOPS with potential of machine learning and alert-correlation technologies.

With a growing digital footprint , we have volume of data in events , logs and alerts. SRE teams should increase application of use cases in AI to create intelligent IT operations as ML models reliably detect patterns and build insight from past experience. AI and automation applied to operations, AIOps, help teams manage the vast volumes of data and achieve proactive incident resolution.

AI is to relieve the manual toil associated in SRE role and free teams to focus on high-value work.

Recommendation # 6 ( Connected Platforms binding instrumentation data )

Companies are using variety of tools for same purpose . There is variety , velocity and volume ( all 3 Vs ) of data that needs to be ingested and understood as one related view.

Thus , isolated data within dashboards of respective tools may not lead to conclusive evidence on problem , incident management .

What we need ? We need to tie up data from all tools and feed into 1 platform ( that ingests , processes and parses the data - structured , unstructured ) and create views out of this that help identify key KPIs , metrics live. This can act as a single pane of view for various personas - SRE engineering , Cxo , testing lead , development lead , operations head etc..

  • Collect the ever-increasing volumes of operations data.
  • Diagnose incident causes & Intelligently co relate significant events and patterns based on real-time analysis and tie to problems.
  • Apply data feeds from variety of tools , aggregate the flow of data & present onto 1 common platform that has AI and ML insights

Recommendation # 7 ( apply SLO-SLI-Error budget )

Set up define & measures for SLO , SLI and error budget for your organization.

We can only improve if we set up a business objective that is aspirational but at the same time we must have defined SLA that is realistic . In between this zone is tolerance with "how quick we can respond". Error budget ( measured in time for non-availability ) gives that zone of being prepared . Applying learning from incidents & right problem management via live observability platform.

We can thus balance - speed of feature delivery ( business need ) and stability , by accounting risk to still be down in case of errors and confidence to resolve . Thus being good at SRE practice is to have balance - being observable & being reliable ( with high availability , higher scalability , higher performance , flexi capacity , high security )



Sami Belhadj

+17K | Software Delivery Manager | Public Speaker | Mentor | Blockchain | AI/ML | DEVOPS | SRE | Oracle DBA

5 个月

?? New Article Alert! ?? Curious about the differences between DevOps, SRE, and Platform Engineering? ?? Check out our latest in-depth guide: “DevOps vs SRE vs Platform Engineering: The Ultimate Guide to Optimizing IT Teams for Scalability and Reliability” ?? What you’ll learn: ? Key differences between DevOps, SRE, and Platform Engineering ? How each strategy enhances scalability, reliability, and efficiency ? Best practices for structuring your IT teams ?? Whether you’re a tech leader looking to improve your IT operations or an engineer eager to understand these methodologies, this guide has you covered! ?? Read now: https://tech-tech.life/platform-engineering.html

Ashish Abrol

Director Technology at Sapient Consulting

5 个月

intriguing article discussing the convergence of GenAI and SRE, as well as the cultural shifts we are seeing in organizations that are prepared to embrace this change.

要查看或添加评论,请登录

Vikram Abrol的更多文章

  • 7 steps to get started towards - Digital transformation journey

    7 steps to get started towards - Digital transformation journey

    Digital transformation is the process by which companies embed technologies across their businesses to drive market and…

    9 条评论
  • Agile Architecture

    Agile Architecture

    I got a question posted to me - What is the role of Agile in decision making when it comes to design and Architecture ?…

  • Balancing autonomy with accountability

    Balancing autonomy with accountability

    This is topic that we had today in #ani ( Agile network India ) online event in Jaipur chapter As a panelist , I am…

    10 条评论
  • Cumulative flow diagrams

    Cumulative flow diagrams

    How can the cause of delays in the value stream be identified? Where are user stories being blocked? Are non-value…

  • #agiletransformation for Product based Digital Enterprises

    #agiletransformation for Product based Digital Enterprises

    #agiletransformation #businessagility #productteam To move towards next generation digital business, Technology and…

    1 条评论
  • Product development with customer - Persona journey maps

    Product development with customer - Persona journey maps

    Product creation is all about discovering latent needs of end customer . For which it is required to engage and…

    4 条评论
  • Consulting tips for Enterprises transforming (scaling ) in Agile

    Consulting tips for Enterprises transforming (scaling ) in Agile

    Establishing Agile Center of Excellence (COE) at portfolio level ( IT and including business) that is supported by the…

  • Transformations & evolution of a true DevOps team

    Transformations & evolution of a true DevOps team

    As teams shift from waterfall to agile , they have to unlearn and learn the new ways of working , role and…

  • Enterprise Agile Transformations - "Essentials"

    Enterprise Agile Transformations - "Essentials"

    Alignment with purpose / pursuit for Agile movement across units in enterprise , top -> down , sponsor -> teams must be…

  • Anti-Agile ways in SAP

    Anti-Agile ways in SAP

    SAP is a big mammal ( a White Elephant ) and Agile looks for fox - quick and spontaneous . Agile and DevOps are still…

社区洞察

其他会员也浏览了