Gen-AI & SRE in focus of the C-suite agenda - learnings and recommendations.
Vikram Abrol
Enterprise transformation Leader at Infosys , IT Process consulting | Lifelong learner | AI enthusiast | Blogger Helping Organizations and systems become "simpler , better , faster "
The SRE Report 2024 speaks of state of Site Reliability Engineering
Here are few insights from the report & correlation to my observations .
Working with few corporations ( financial ) over the years I see that C suite focusing more on deriving benefits from Gen AI & apply site reliability engineering practices but have a few fallacies that need to be first corrected.
Recommendation # 1 : ( SRE as a role and practice across SDLC ) The SRE practices go beyond just "optimizing operations" , these focus on optimizing system design. This is a common fallacy - focus on "myopic" objectives of post production operations . Get it right SRE ( as a science ) and its focus must germinate right from requirements . This is part of mindset shift .
Bring SRE embedded to troubleshoot issues inside the product team supported with Product mgmt. and Platform engineering as common .
SRE COE / chapter can be a horizontal but needs dedicated role in the feature teams .
Recommendation # 2 ( Identify system flow and customer journeys ) : System design is also about structures of teams . Teams should embed SRE and set up around flow of value . Hence adopt product centricity with empathy mapping & customer journeys to understand the existing flow , what are the impediments / blockers to customer in the touchpoints and where is efficiency losses on the way due to systemic issues on UI/UX design , performance irritants , etc..
Recommendation # 3 ( focus on business critical journey + loads of instrumentation ) Talking of flow - identify business critical journeys and add more instrumentation on the way behind the scenes . This telemetry of closely connected tools would yield insights to detect system anomalies that can be isolated and hence consume less of error budget. The platform engineering team helps sets up these tools to help SRE localization of issues.
Recommendation # 4 ( mindset ) Increasingly , it is felt the best design is when SREs are involved right from the beginning ( design , architecture ) and NFRs - security , performance , scalability and Infra capacity planning , all requirements & activities needs thinking much before code is written and tested.
Organizations to inculcate the mindset of SRE with below shifts
领英推荐
Recommendation # 5 ( Bring in AI )
SRE focuses on delivery and the stability of the production environment, while DevOps focuses on the end-to-end application lifecycle .If Devops was considered as a class , SRE implements that class. SRE is now going beyond using AIOPS with potential of machine learning and alert-correlation technologies.
With a growing digital footprint , we have volume of data in events , logs and alerts. SRE teams should increase application of use cases in AI to create intelligent IT operations as ML models reliably detect patterns and build insight from past experience. AI and automation applied to operations, AIOps, help teams manage the vast volumes of data and achieve proactive incident resolution.
AI is to relieve the manual toil associated in SRE role and free teams to focus on high-value work.
Recommendation # 6 ( Connected Platforms binding instrumentation data )
Companies are using variety of tools for same purpose . There is variety , velocity and volume ( all 3 Vs ) of data that needs to be ingested and understood as one related view.
Thus , isolated data within dashboards of respective tools may not lead to conclusive evidence on problem , incident management .
What we need ? We need to tie up data from all tools and feed into 1 platform ( that ingests , processes and parses the data - structured , unstructured ) and create views out of this that help identify key KPIs , metrics live. This can act as a single pane of view for various personas - SRE engineering , Cxo , testing lead , development lead , operations head etc..
Recommendation # 7 ( apply SLO-SLI-Error budget )
Set up define & measures for SLO , SLI and error budget for your organization.
We can only improve if we set up a business objective that is aspirational but at the same time we must have defined SLA that is realistic . In between this zone is tolerance with "how quick we can respond". Error budget ( measured in time for non-availability ) gives that zone of being prepared . Applying learning from incidents & right problem management via live observability platform.
We can thus balance - speed of feature delivery ( business need ) and stability , by accounting risk to still be down in case of errors and confidence to resolve . Thus being good at SRE practice is to have balance - being observable & being reliable ( with high availability , higher scalability , higher performance , flexi capacity , high security )
+17K | Software Delivery Manager | Public Speaker | Mentor | Blockchain | AI/ML | DEVOPS | SRE | Oracle DBA
5 个月?? New Article Alert! ?? Curious about the differences between DevOps, SRE, and Platform Engineering? ?? Check out our latest in-depth guide: “DevOps vs SRE vs Platform Engineering: The Ultimate Guide to Optimizing IT Teams for Scalability and Reliability” ?? What you’ll learn: ? Key differences between DevOps, SRE, and Platform Engineering ? How each strategy enhances scalability, reliability, and efficiency ? Best practices for structuring your IT teams ?? Whether you’re a tech leader looking to improve your IT operations or an engineer eager to understand these methodologies, this guide has you covered! ?? Read now: https://tech-tech.life/platform-engineering.html
Director Technology at Sapient Consulting
5 个月intriguing article discussing the convergence of GenAI and SRE, as well as the cultural shifts we are seeing in organizations that are prepared to embrace this change.