登录查看更多内容

Are my systems "Observable"?

Kaushik Banerjee ( He/Him/His )

SVP| Autonomous & Accountable DevOps, APAC SRE Head for Trading Tech| Execution, Empathy & Unleashing Team's Potential| I help Organizations reduce TOIL ,MTTR & MTTD while Improving Resiliency & Reliability

发布日期: 2021年12月12日

So people ask/wonder often, we have so much ( too much ? ) alerting and monitoring. Are my systems what they call "Observable" ?.

Maybe..maybe not. Let's explore.

Monitoring: It is the gathering of surface-level data points ( in legacy systems, monitoring may mostly be events-based alerting ). In very simple cases some of these isolated data points/alerts can tell you the cause of your system failure e.g. Hardware off-line or Database Crashed.

Visibility: It is understanding the various components in your system in isolation. So visibility of your servers, visibility of your networks, of your market data, of your distributed devices.

Observability: To understand the internal state of a system from its surface-level information i.e. from the data it's spewing out.

It's putting the above two ( monitoring and visibility ) together and contextualizing it by adding more layers to it.

A holistic view of the entire system and/or ecosystem. It contains ( but is not limited to ) logs, traces (especially on distributed systems), metrics, Machine learning.

( Image credit, OpsRamp).

Centralized Monitoring of the right data points, of all your devices and environments, is the foundation piece of observability.

Juxtaposing algorithmic real-time log analysis with centralized monitoring, visibility of the entire ecosystem, and tracing of distributed systems will go a long way in providing observability in our systems.?

领英推荐

Contribute to OpenTelemetry to enhance end-to-end…

IBM Hybrid Cloud and Infrastructure 7 个月前

??GovCon Market Intelligence by G2Xchange | 6-12-24

G2X - The GovCon Growth Platform 9 个月前

??GovCon Insights by G2Xchange | 5-14-24

G2X - The GovCon Growth Platform 10 个月前

Applying ML on these will provide actionable insights that can allow DevOps/SRE/ITOps teams to increase the stability of the systems. With a virtuous cycle of the above, and improving SLIs your SLO and SLA should be achievable.

To visualize this, imagine single drawings on various tracing papers. One has a Sun, one has a Palm tree, one has a lake, one has boats. By themselves, they are a correct data point but don’t tell much. Juxtapose them on top of each other. And they form a story, a complete picture of? A sunset on a lake.

To take another example.?

Imagine if your system was a person, who has communication problems ( Say has a different language or is mute) and hence unable to tell you if anything is wrong with her/him.

We check her/his temperature. It's a bit high, is something wrong ?.

(S)he implies that her/his left arm feels a bit tingly sometimes ( like your intermittent connection errors from various systems).

So we SUSPECT something might be wrong, but don’t know how wrong and/or whether it even merits any action (and if yes, what action ?).

But if the person could talk and elaborate all what (S)he is feeling properly ( i.e. we had proper observability), then it could have told us that there was some numbness, a bit of dizziness,?left side leg and arm not responding intermittently, haziness of vision. That would have told you that there is a high chance that the person had a stroke. And take emergency measures accordingly.

So a perfectly observable system is one whose complete internal state is understandable just by the data ( and patterns in that data ) being provided by that system.

In such a system, you can tell straight away ( and maybe even see it coming from a few miles away) whether a slow response is due to some calls going into loops, failed servers, memory exhaustion, or even network/switch level issues.

Right monitoring is at its core but it is much more than that. And plain events, threshold-based alerting is not the complete toolkit for complex systems.

Dheeraj Kaul

Strategy & Transformation at Manulife Asia

3 年

Keep going Kaushik Banerjee ( He/Him/His )

1 次回应

查看更多评论

要查看或添加评论，请登录

Kaushik Banerjee ( He/Him/His )的更多文章

A Quick Linux Performance Analysis.

2024年1月15日

A Quick Linux Performance Analysis.

Over the last few weeks, I have twice run into "something is slow on the server side". To make my life easier and…

1 条评论
Back to Basics: DevOps

2023年6月8日

Back to Basics: DevOps

Having been in DevOps (and SRE ) for a bit now, I decided to redo some basic courses. I found the Fundamentals of…
Can your process ( or VM ) allocate more memory than is physically available on the underlying Host ?

2023年3月22日

Can your process ( or VM ) allocate more memory than is physically available on the underlying Host ?

While trying to figure something out for Linux VMs on new hardware, I noticed that the cheapest tier had the following…
Linux Performance issues troubleshooting

2022年8月20日

Linux Performance issues troubleshooting

You are having performance issues on your Linux Server ( Bare metal or EC2). You login and start checking the…
What Happens When/During: File transfer in Linux

2022年7月12日

What Happens When/During: File transfer in Linux

Part 3: During File transfer in Linux Preface: As part of improving my general knowledge, I have hit upon the (…
What Happens When/During :

2022年6月9日

What Happens When/During :

Part 2: During *nix Login Preface: As part of improving my general knowledge, I have hit upon the ( Brilliant? Foolish…
What Happens When/During :

2022年5月25日

What Happens When/During :

Part 1: During Linux System Boot & Startup Preface: As part of improving my own general knowledge, I have hit upon the…

1 条评论
Why Linux CLI is the spearpoint blade of your SRE/DevOps/ITOps swiss army knife.

2022年1月8日

Why Linux CLI is the spearpoint blade of your SRE/DevOps/ITOps swiss army knife.

Why Linux CLI is the spearpoint blade of your SRE/DevOps/ITOps swiss army knife. Trying something on Docker got me…
Surprising facts about energy consumption in PoW@BTC

2021年11月1日

Surprising facts about energy consumption in PoW@BTC

I read 3 lucid articles in the last 24 hrs which repudiates in great length the general impression that PoW miners are…

1 条评论
Where/How to start when assembling a new SRE Team.

2021年10月23日

Where/How to start when assembling a new SRE Team.

The below is my interpretation of an interesting talk by Benjamin Bütikofer at USENIX SRECon21. All the good parts are…

See all articles

Are my systems "Observable"?

Kaushik Banerjee ( He/Him/His )

SVP| Autonomous & Accountable DevOps, APAC SRE Head for Trading Tech| Execution, Empathy & Unleashing Team's Potential| I help Organizations reduce TOIL ,MTTR & MTTD while Improving Resiliency & Reliability

领英推荐

Kaushik Banerjee ( He/Him/His )的更多文章

社区洞察

其他会员也浏览了

Forte Spotlight: Tech's Strategic Inflection Point

??GovCon Insights by G2Xchange | 11-9-23

Log and trace management made easy. Quickwit Integration via Glasskube

Telemetry: Unlocking the Hidden Power of Observability in Axon Server Applications

OpsTeams and Observability Achieving True Operational Insight

This week: Cloudflare - Team8 - Infoblox - Ai for Alpha - Snowflake - Mistral - Accenture

A brief look at OpenTelemetry, Observability, and Standardization

Forward Networks has teamed up with NetBox Labs to lower the barriers to adopting network automation

Unlocking the Power of CAP and PACELC Theorems

The Quest for MicroAgents: Request-Response or Event-Driven? (Part 3.1)

领英推荐

Kaushik Banerjee ( He/Him/His )的更多文章

A Quick Linux Performance Analysis.

Back to Basics: DevOps

Can your process ( or VM ) allocate more memory than is physically available on the underlying Host ?

Linux Performance issues troubleshooting

What Happens When/During: File transfer in Linux

What Happens When/During :

What Happens When/During :

Why Linux CLI is the spearpoint blade of your SRE/DevOps/ITOps swiss army knife.

Surprising facts about energy consumption in PoW@BTC

Where/How to start when assembling a new SRE Team.

社区洞察

其他会员也浏览了

Forte Spotlight: Tech's Strategic Inflection Point

??GovCon Insights by G2Xchange | 11-9-23

Log and trace management made easy. Quickwit Integration via Glasskube

Telemetry: Unlocking the Hidden Power of Observability in Axon Server Applications

OpsTeams and Observability Achieving True Operational Insight

This week: Cloudflare - Team8 - Infoblox - Ai for Alpha - Snowflake - Mistral - Accenture

A brief look at OpenTelemetry, Observability, and Standardization

Forward Networks has teamed up with NetBox Labs to lower the barriers to adopting network automation

Unlocking the Power of CAP and PACELC Theorems

The Quest for MicroAgents: Request-Response or Event-Driven? (Part 3.1)