登录查看更多内容

Powering Reliability through Observability

Anjan Mazumdar

Technology Director at Tesco

发布日期: 2022年11月24日

The picture does not resonate with the article here, but does guarantee a lot of taste and would definitely make your day feel good :-) Hope this article does compliment that taste!

"I have been part of many system outages in my career across different Business Tiers and I am proud of the teams who have progressed incredibly from there. Lots of things seems relevant for engineering teams working in complex and distributed landscape. One thing that stands out is how we have continually improved our resiliency and have reduced the impact of those outages. But, there is more to do, whilst as an industry we have matured, we need to continually work on this subject and develop the culture of continuous improvement."

The software industry is growing, we have seen a massive increase in cloud adoption and also various other modernisation exercises last few years and it will continue to be one of the focus areas. Digital transformations are seen as the need and because microservices, serverless computing and distributed architecture/deployments becoming prevalent, the technical complexity to design and build a reliable system has therefore increased many folds. Hence, the topic of Observability and instrumentation flares up a lot more when we discuss.

Technology is at the heart of our business, resilient and reliable systems would facilitate more growth and drive customer happiness. Customer and stakeholders expectations have risen, zero tolerances towards outages or disruptions are now the de-facto principles. There is therefore constant focus on resilient and fault tolerant architecture and with a supported tech stack.

领英推荐

Why System Scalability Requires A CTO With An…

Vintage Global 6 个月前

Enhanced Testing of SONiC NOS with BGP NetOps, Dual…

Aviz Networks 1 年前

Embracing the Future: Firefly Innovation and the…

Firefly 7 个月前

With the advent of distributed architecture, there is an increase in challenge around accurately spotting an issue and therefore to troubleshoot and improve. Take for example, our container-based architecture introduces challenges like interdependent microservices, if they are scattered across multiple hosts, with infrastructure scaling, makes it difficult for developers to know the current performance. According to a latest study, an hour of downtime costs anywhere between 250K – 1M (GBP), depending on the businesses we operate. But not everything can be measured as directs costs, there are indirect costs on interruption at work, causing delays to what might be considered as a productive work.

Observability through meaningful SLI/SLO gives valuable insight into the state of the systems we operate and helps bringing visibility and drive customer happiness. Instrumentation should be built in to every code that we build and operate on.

Defining SLA, making an error budget and SLO is not just an engineering decision. It’s a Product and Engineering decision together, that requires input from various stakeholders from all parts of the organization, and therefore remains a valuable collaboration exercise. It needs to be carefully thought through and established, the output of that needs to be part of the team cadence, to continuously remind us on making our systems reliable. Priority needs to be given to that backlog of work, depending what the Error Budgets and the SLOs convey.

We must also understand that the more reliable a service, the more expensive it is to operate. A good starting point would be to establish a lowest level of measure that we can manage, undermining the SLA. A very high availability metric, if not thought through properly could become an issue, so the intent must be balanced and defined in a way, to ensure the teams are looking at the right things without getting fatigued. SLA/SLO/SLIs are a powerful tool and whilst it can be tricky to get the balance and get it right, the suggestion is to constantly course correct with the culture of continuous improvement, thus bringing in engineering effectiveness.

Tools can only help to a certain point, it is the cultural change that we need to drive, understand what matters, define and then measure.

Ashish Banerjee

Director of Engineering at OpenText

2 年

You need to leverage ITOM AIOps to bring peace of mind

1 次回应

要查看或添加评论，请登录

Anjan Mazumdar的更多文章

Inside and Beyond 15

2023年6月23日

Inside and Beyond 15

My Journey Tesco - Inside and Beyond 15 – A Reflection A hint of sun through the clouds, meant a possibility that the…

23 条评论
Kafka Summit 2023 London - My Takeaway and a Summary

2023年5月18日

Kafka Summit 2023 London - My Takeaway and a Summary

It was a huge moment attending the Kafka Summit 2023 in London, along with my Tesco colleagues. Met some great minds at…

4 条评论
Me at Tesco ...

2020年6月10日

Me at Tesco ...

Hi, I’m Anjan, Head of Software Development at Tesco. I am a passionate technologist with experience in Retail…

55 条评论

Powering Reliability through Observability

Anjan Mazumdar

Technology Director at Tesco

领英推荐

Anjan Mazumdar的更多文章

社区洞察

其他会员也浏览了

From Cross-Dependencies to Scalability: How We Built a "Super Core" for Power Platform Solutions

Understanding Microservice Meshes: Architecture, Operation, and?Examples

Transform Your Decision-Making Process with SRE Principles

#76: Is this Military-Grade Platform Engineering?

Dealing with Performance Limits? Take an SRE Approach

Improving System Reliability with Observability Practices: A KineticSkunk Perspective

Load Balancing in an API Gateway: Efficient Traffic Management in Microservices Architecture

Stability, Seamless Scaling, and Cost-Efficiency — Clients Want It All. Part 1: Stability

Day 4: Architecting for Resilience, Observability & Performance Optimization

Building Resilient IT Architectures: Strategies for Modern Businesses

领英推荐

Anjan Mazumdar的更多文章

Inside and Beyond 15

Kafka Summit 2023 London - My Takeaway and a Summary

Me at Tesco ...

社区洞察

其他会员也浏览了

From Cross-Dependencies to Scalability: How We Built a "Super Core" for Power Platform Solutions

Understanding Microservice Meshes: Architecture, Operation, and?Examples

Transform Your Decision-Making Process with SRE Principles

#76: Is this Military-Grade Platform Engineering?

Dealing with Performance Limits? Take an SRE Approach

Improving System Reliability with Observability Practices: A KineticSkunk Perspective

Load Balancing in an API Gateway: Efficient Traffic Management in Microservices Architecture

Stability, Seamless Scaling, and Cost-Efficiency — Clients Want It All. Part 1: Stability

Day 4: Architecting for Resilience, Observability & Performance Optimization

Building Resilient IT Architectures: Strategies for Modern Businesses