Powering Reliability through Observability

Powering Reliability through Observability



The picture does not resonate with the article here, but does guarantee a lot of taste and would definitely make your day feel good :-) Hope this article does compliment that taste!

"I have been part of many system outages in my career across different Business Tiers and I am proud of the teams who have progressed incredibly from there. Lots of things seems relevant for engineering teams working in complex and distributed landscape. One thing that stands out is how we have continually improved our resiliency and have reduced the impact of those outages. But, there is more to do, whilst as an industry we have matured, we need to continually work on this subject and develop the culture of continuous improvement."

The software industry is growing, we have seen a massive increase in cloud adoption and also various other modernisation exercises last few years and it will continue to be one of the focus areas. Digital transformations are seen as the need and because microservices, serverless computing and distributed architecture/deployments becoming prevalent, the technical complexity to design and build a reliable system has therefore increased many folds. Hence, the topic of Observability and instrumentation flares up a lot more when we discuss.

Technology is at the heart of our business, resilient and reliable systems would facilitate more growth and drive customer happiness. Customer and stakeholders expectations have risen, zero tolerances towards outages or disruptions are now the de-facto principles. There is therefore constant focus on resilient and fault tolerant architecture and with a supported tech stack.

With the advent of distributed architecture, there is an increase in challenge around accurately spotting an issue and therefore to troubleshoot and improve. Take for example, our container-based architecture introduces challenges like interdependent microservices, if they are scattered across multiple hosts, with infrastructure scaling, makes it difficult for developers to know the current performance. According to a latest study, an hour of downtime costs anywhere between 250K – 1M (GBP), depending on the businesses we operate. But not everything can be measured as directs costs, there are indirect costs on interruption at work, causing delays to what might be considered as a productive work.

Observability through meaningful SLI/SLO gives valuable insight into the state of the systems we operate and helps bringing visibility and drive customer happiness. Instrumentation should be built in to every code that we build and operate on.

Defining SLA, making an error budget and SLO is not just an engineering decision. It’s a Product and Engineering decision together, that requires input from various stakeholders from all parts of the organization, and therefore remains a valuable collaboration exercise. It needs to be carefully thought through and established, the output of that needs to be part of the team cadence, to continuously remind us on making our systems reliable. Priority needs to be given to that backlog of work, depending what the Error Budgets and the SLOs convey.

We must also understand that the more reliable a service, the more expensive it is to operate. A good starting point would be to establish a lowest level of measure that we can manage, undermining the SLA. A very high availability metric, if not thought through properly could become an issue, so the intent must be balanced and defined in a way, to ensure the teams are looking at the right things without getting fatigued. SLA/SLO/SLIs are a powerful tool and whilst it can be tricky to get the balance and get it right, the suggestion is to constantly course correct with the culture of continuous improvement, thus bringing in engineering effectiveness.

Tools can only help to a certain point, it is the cultural change that we need to drive, understand what matters, define and then measure.

Ashish Banerjee

Director of Engineering at OpenText

2 年

You need to leverage ITOM AIOps to bring peace of mind

要查看或添加评论,请登录

Anjan Mazumdar的更多文章

  • Inside and Beyond 15

    Inside and Beyond 15

    My Journey Tesco - Inside and Beyond 15 – A Reflection A hint of sun through the clouds, meant a possibility that the…

    23 条评论
  • Kafka Summit 2023 London - My Takeaway and a Summary

    Kafka Summit 2023 London - My Takeaway and a Summary

    It was a huge moment attending the Kafka Summit 2023 in London, along with my Tesco colleagues. Met some great minds at…

    4 条评论
  • Me at Tesco ...

    Me at Tesco ...

    Hi, I’m Anjan, Head of Software Development at Tesco. I am a passionate technologist with experience in Retail…

    55 条评论

社区洞察

其他会员也浏览了