登录查看更多内容

Observability — That Last 9

Akash Saxena

CTO Jiocinema | CTO Excellence Award 2024 | ex-CTO Hotstar[Asia|MENA|SEA] | ex-OpenTable

发布日期: 2022年10月1日

+ 关注

TL;DR: A stitch in time, saves 9. Discussion on key blocks of observability.

Mindful Observability

Pick your metaphor. Not knowing how fast, or where you’re heading, mostly ends in wreckage, atleast, in the context of a technical setting. A skilled operator must?know her machine , so that she can be mindful about the journey and extract the most out of the experience.

The problem at hand is about visualising how our distributed system and it’s constituents are behaving. We want to know this, because we are customer obsessed and want to ensure that our product always works for our customers, while being frugal for our business.

Uptime & bottomline, both matter.

Building Blocks

Behold, here lie, in full view, services.How does it all tie in together though

Self Maintaining Topology

Point in time topologies, architecture blocks go only so far. Maintaining organisational knowledge bases is such a complex problem that it merits it’s own blog. A knowledge base (KB) article is like a car, it loses half it’s value the moment you drive it off the lot!

What we have are traffic logs and traffic in all directions. There is gold here. Leveraging that input and constructing a topology that is self-maintaining. This is the starting point. Being able to simply visualise how everything connects together, is the foundational rock.

Flow Analytics

With the view in place, let’s turn our attention on what’s riding on them pipes. Silo-ed observability will tell you when your particular block is running into trouble (or not). You’re left to figure out the rest through what is often tribal knowledge.

Flow logs will ultimately help you build your “service compass” , N/S, E-W connectivity, super helpful for causality and also dependency graphs for when changes are made in a block. This can be gold when your team is growing / new or just plain moving too fast to know all the moving parts.

Flow log modeling solves for discoverability and smells, basis deviations that are observed from the norm. Bonus, if we factor in seasonality and “curve-fit” appropriately, can help immensely in quickly discovering problems before they snowball!

领英推荐

Issue 9 - April Recap

DataArt 6 个月前

Altair Forward First – October 2023 Edition

Altair 1 年前

Conquering next-gen challenges with continuous test…

思博伦通信 3 个月前

Resiliency

Can the system survive fatalities in individual blocks of execution? This is akin to having the ability to isolate fire in a a self contained block / concern, so as to prevent spread and survive with degradation. If we view each block as expendable, then what does it take to “mock” that block?

Inevitably, post “k” failures, it’s no longer tenable to degrade, but still, that improves your overall resiliency.

We can deem this as a “panic-response”, so that the block continues to act as if it’s there, whereas, in reality it’s non-functional. Having a positive affirmation that the leaky chamber is shut, is very valuable in an incident. The ability to spot smoke, replace behavior (panic), and knowthat things are OK, is a factor to reduce your Mean Time To Recovery (MTTR), and an observability platform lets you do that solidly.

Seasonality & Operational Intelligence

It’s also valuable to have an observability platform that recognises this and adapts and learns how your system behaves over time. Truly then, can we begin to detect “smoke”.

Customer traffic is driven by intent / time of day , mood and several other human factors, that are sometimes impossible to predict. While we’ve seen the adverse effects of these, traffic “tsunami’s” , the flip is also true, where traffic is so low, relative to BAU levels, that often errors go un-noticed.

An acknowledgement of variability is also very valuable when we’re making projections about traffic and want to see how customers are using the system. Then, using this as a basis for what-if analysis to ultimately prepare for the onslaught of real production grade traffic.

Second Order Benefits

Once the baseline system for observability is in place, you can also layer in spend tracking and optimisation, whether cloud or hybrid, given that you now have a system of record.

If the pipes and flows are instrumented, then how far behind can log analysis based threat modeling be? The same flow patterns can be analysed for optimal flow control through the services for hot spot identification and performance tuning.

While perhaps not strictly under the definition of what we think of as observability, these are key measures for any team to look at as well. Perhaps, higher in the observability Maslow hierarchy though!

The Last 9

Observability is a foundational building block and can unlock much goodness — however, it’s deviously complex to get right. The founders at?Last9.io , aptly named, have been amazing co-build partners on trying to make in-roads on what a solid observability platform should be and hit most, if not all of the building blocks.

When dealing with any system, sans observability, it’s just not mindful operation of the value that the system can unleash. Knowing how your machine works and being mindful of when it’s roaring or purring is so key!

“The test of the machine is the satisfaction it gives you. There isn’t any other test. If the machine produces tranquility it’s right. If it disturbs you it’s wrong until either the machine or your mind is changed.”

―*Robert M. Pirsig,?* Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values ?

Observability — That Last 9

Akash Saxena

CTO Jiocinema | CTO Excellence Award 2024 | ex-CTO Hotstar[Asia|MENA|SEA] | ex-OpenTable

Mindful Observability

Building Blocks

Self Maintaining Topology

Flow Analytics

领英推荐

Resiliency

Seasonality & Operational Intelligence

Second Order Benefits

The Last 9

更多精彩文章

社区洞察

其他会员也浏览了

Evaluating Evals: Who will win AI’s reliability race?

Putting Our Customers in the Driver's Seat

Spekit Year in Review: What we shipped in 2023 and what's coming in 2024

Exploring RAG System Architectures: A Comparative Analysis

Solving the Right Problems with the 4U Framework

TLDR: Lessons from 1 year of building with LLMs

Week of October 28th

Advanced Timing Optimization Through Adaptive Retiming and LEC

OD83: Dealing with information overload ??

#oneweekoneusecase observations (I)

Mindful Observability

Building Blocks

Self Maintaining Topology

Flow Analytics

领英推荐

Resiliency

Seasonality & Operational Intelligence

Second Order Benefits

The Last 9

Failure Engineering - API Edition

2024年9月20日

Be Memorable

2024年8月5日

SRE Playbook - Step By Step

2023年10月30日

Value Streams - Notes on Planning with OKR’s

2022年7月14日

Cricket & Agile Software Delivery

2021年2月9日

Scaling the Hotstar Platform for 50M

2019年10月2日

Scaling Is Not An Accident

2018年4月23日

Daring — Culture Tenets @ Hotstar

2018年1月10日

Locks In the Time Of Lock-pickers

2017年10月6日

T for Tsunami : Dealing with traffic spikes

2017年6月20日

社区洞察

其他会员也浏览了

Evaluating Evals: Who will win AI’s reliability race?

Putting Our Customers in the Driver's Seat

Spekit Year in Review: What we shipped in 2023 and what's coming in 2024

Exploring RAG System Architectures: A Comparative Analysis

Solving the Right Problems with the 4U Framework

TLDR: Lessons from 1 year of building with LLMs

Week of October 28th

Advanced Timing Optimization Through Adaptive Retiming and LEC

OD83: Dealing with information overload ??

#oneweekoneusecase observations (I)