登录查看更多内容

Monitoring, APM, OpenTelemetry, Observability - modern-day requisites for uninterrupted business operations

Amit Pal

Engineering Leader@Egnyte | ERN-stack Architect | Empowering Engineers | Sharing Insights Weekly (WebWiz Newsletter)

发布日期: 2024年7月15日

A couple of months ago, I was interviewing a few candidates. I had a heartfelt experience when some of them revealed a fundamental misconception that APM, Observability, and Monitoring are just synonymous. Later I realized that some of my DevOps engineer friends also feel diffident to distinguish the same. Nevertheless, I do not consider myself an SME but gained fundamental comprehension on this matter due to my ~3 years of a stint at AppDynamics in a leadership role. That motivates me to write my 35th article on this topic.

If you find it insightful and appreciate my writing, consider following me for updates on future content. I'm committed to sharing my knowledge and contributing to the coding community. Join me in spreading the word and helping others to learn.

Prologue

The success of any business depends on robust system architecture. A system architecture could be either monolithic or distributed or integrated with single/multiple 3rd party systems, or a combination of all. Be it Monitoring, Observability, or anything else, a surveillance system must be in place for better sustainability. Somewhere on the internet, I glanced through an excellent explanation of how Observability is different from Monitoring in just layman's terms. As I can not remember the reference, quoting is not possible, but definitely state the gist I grasped.?

Let us personify the Monitoring as a factory technician who knows the repairing techniques for typical repetitive faults in machines. Such reactive repairs resolve the issue but do not prevent machine downtime. The person can only address the known unknowns.?

Now consider The Observability is a senior technician who keeps an experienced eye on the factory's central control panel to fathom the preemptive warning signs of errors in a machine. Then proactively address the issue to avert any possible breakdowns.

I hope it illustrates a lot!

The DIKW model?

Before we dive into the technical details, we must comprehend why a bunch of data is purposeless if we fail to seek wisdom from it. No matter what or how many datasets you collect, the root-cause problems can never be addressed if they fail to deliver any insight. DIKW model (Data-Information-Knowledge-Wisdom) exemplifies the matter in detail. I would recommend reading through?this article ?first. Also, I'm adding a pictorial guide to get it easily.

The Monitoring??

A monitoring system fetches some predetermined metrics/data from every engineering stack of the system. That dataset is depicted in a dashboard for scrutinization purposes. However, it lacks providing any collective insight or wisdom. This is basically the responsibility of the IT team to analyze the received dataset and take appropriate actions to keep the system functioning. In the case of any complex cloud-native app, it becomes a challenging job for the IT team to monitor everything and make an optimal decision. AppDynamics, Prometheus and Grafana, Datadog, and Dynatrace are a few popular tools available in the market.?

The APM

Application Performance Monitoring (APM) is another type of tool. It solely focuses on overall user experience and application performance. It fetches data like average response time, throughput, network traffic, error rates, predefined business KPIs and SLOs (service level objects), and many more. Then that dataset is depicted in a dashboard to dig through the root cause behind any performance issues. Let me name a few APM tools such as AppDynamics, New Relic, Dynatrace, etc.?

The Observability

Observability can be considered a superset to attain 360-degree control over a system. We often call it full-stack Observability as well. Apart from aiding monitoring support, it provides a thorough insight into how various segments are integrated and forecasts issues methodically analyzed by AI. Observability leverages telemetry data to fetch the current state of the engineering stack. It involves collecting traces, logs, events, and metrics across all applications within that stack. It also facilitates observing any transactions and their performance metrics with granular details from start to end and how those transactions were handled by each stack. Overall, it operates in a reactive way that helps developers with debugging, profiling, dependency analysis, and tracing the issue in the whole system. AppDynamics, Signoz, Dynatrace, Datadog, and Splunk, are a few leading tools currently rocking the Observability market.?

The OpenTelemetry (OTEL)

OpenTelemetry is not a platform but rather an open-source Observability framework. It is an open-source project that collects and translates the telemetry data, including MELT (metrics, events, traces, and logs), into a language-agnostic format. Let me explain the rudimentary concepts of the MELT model below.?Please refer to the architecture screenshots as well.

Vertisystem 1 年前

DevOps/SRE on Security Compliance and FedRAMP

W Martin W. 1 年前

Monitoring and Logging Tools in DevOps

Himani Patidar 3 个月前

If you want to cultivate it deeply, read this GitHub page or explore the official documentation .

Metrics: It is either singular or aggregated set measurements accumulated at regular intervals. It features a timestamp, a name, one or more numeric values, and a count of represented events. Metric examples include error rate, response time, or throughput.
Events: An event is a discrete action happening at any time. It is used to validate the occurrence of a distinct action at a particular time and enable an exhaustive analysis in real time.?
Logs: This is important when engineers are in deep debugging mode, trying to understand an issue and troubleshoot code. The Logs provide high-fidelity data and detailed context around an event, so engineers can recreate what happened at every interval. However, sometimes it represents unstructured data. Then it requires to be structured by using some tools for coherent analysis. However, Logs often get confused with Events. Events contain a higher level of abstraction than the level of detail provided by logs. Logs record everything, whereas events are records of selected things only.
Traces: Traces are chains of events. Trace data is needed to determine the relationships between different entities. Traces are elementary for highlighting inefficiencies, bottlenecks, and roadblocks in the service as they show the end-to-end latency of individual calls in a distributed architecture.

Please explore?this link ?for more details on the MELT model. Also, I would surely recommend cultivating a fascinating history of OTEL in?this article .?

I can reckon some of the primary advantages of using OpenTelemetry are:

Reduces performance overhead to generate and manage telemetry data as it is equipped with libraries and agents to auto-instrument popular libraries and frameworks requiring minimal changes to your codebase
Supports multiple popular programming languages like Java, JavaScript, C++, Python, .NET, etc.?
Provides the freedom to switch to new backend analysis tools by using relevant exporters

Although it does instrumentation of data but lacks a visualization layer. Either the Engineering team should develop a custom layer or any other popular tool should be integrated to render the exported OTEL dataset.?

AI-enabled Observability, race with Monitoring, and APM

Nowadays, applications are getting complex with many abstraction layers and keeping it distributed to reduce tight coupling among IT infrastructure. Add to that increasing customer demands for a smooth 24x7x365 experience, the need for quick updates via modern CI/CD pipelines, and the continued evolution of The Great Cloud Migration. Such big MELT data makes IT professionals overwhelmed. That's where Observability and AI pitch in together.?

By collecting and analyzing the MELT data, Observability tools empower the DevOps team to at least monitor all these data and regain insight into what is happening in their systems. Integrated AI brings predictability in terms of forecasting the issues based on heuristic MELT data. This is something that traditional Monitoring tools fail to do. When the time comes to look beyond Monitoring and managing this morass of next-gen digital eminence, AI leverages the machine learning-powered advantage to make a difference.

Observability is also leaving the APM behind as it allows teams to quickly find critical issues in their cloud-native, microservices-based apps. Modern microservice architectures increase velocity and scale. Besides that, it also brings painful complexity and unpredictability. Legacy APM tools fail to debug the issue because they were built to examine uncomplicated monolithic applications in predictable environments.?

As AI-enabled Observability brings the ultimate source of truth, many organizations have started adopting it to ease their business operations.

Future of Observability and impact on business operations

There is no stopping as we just embarked on an observability journey. As far as I researched or grasped knowledge from various articles, lectures, or talking to SMEs at APPD, I can jot down a few hypotheses on next-gen Observability opportunities.?

Despite all the tools available, troubleshooting is still incredibly hard in some scenarios. So increased adaptation of AI technology is inevitable. Besides forecasting, automatic mitigation of some mechanical issues could be an imminent opportunity.
Deep integration of CI/CD is a potential use case to manifest semantic code comprehension. Once it matures, it will enable an emphatic prospect of understanding which piece of newly merged code caused the regression or performance degradation.
The next big challenge would be to unify the separate worlds of Observability and business analytics tools. Overall, they are all about slicing and visually dicing data to cognize it as a whole. Business metrics may be impacted by technical problems that eventually may cause a butterfly effect on overall business operations. That's why end-to-end unification is the future goal.?

An excellent?AppD blog ?was published regarding the future of Observability. I would recommend skimming through that article once.

Picking the right tool

Despite the plethora of tools available in the market, picking the right tool is essential. However, this is a broader topic for my 36th article, if I may write in the future. AppDynamics is a futuristic, widely used at the enterprise level, investing a lot in open source OTEL framework. If you feel interested, feel free to explore?this link .?

WebWiz

756 位关注者

Andrew Mallaband

Helping Tech Leaders & Innovators To Achieve Exceptional Results

4 个月

Amit this is a very extensive article. Great work. There is something missing that I believe is very important and captured in a post I sent out today. Please take a look and share your thoughts https://www.dhirubhai.net/posts/andrew-mallaband-88b1b7_observability-platformengineering-devops-activity-7219327950637150209-_w2s?utm_source=share&utm_medium=member_ios

2 次回应

查看更多评论

要查看或添加评论，请登录

Amit Pal的更多文章

Concurrency vs. Parallelism in Software Engineering - get rid of your confusion

2024年10月1日

Concurrency vs. Parallelism in Software Engineering - get rid of your confusion

In software engineering, concurrency, and parallelism are two fundamental concepts that significantly impact the design…
Building Resilience in Applications: A Comprehensive Guide to Retry Logic

2024年9月24日

Building Resilience in Applications: A Comprehensive Guide to Retry Logic

In software development, failures and errors are inevitable. No matter how clean your code is or how comprehensively…
Sprint Burned-Down Chart: Team's Nightmare and Micro Perfectionism in Agile

2024年9月17日

Sprint Burned-Down Chart: Team's Nightmare and Micro Perfectionism in Agile

It's been a while since I shifted my career path from being a hardcore developer to an engineering manager for the…
Does DuplexPair Crack Down WebSockets? A Ninja Technique Unveiled in Node.js 22.6

2024年9月10日

Does DuplexPair Crack Down WebSockets? A Ninja Technique Unveiled in Node.js 22.6

The DuplexPair API in Node.js, introduced in version 22.
Dead Letter Queue Management in Webhooks

2024年9月3日

Dead Letter Queue Management in Webhooks

Dead letter queues (DLQs) are crucial components in creating dependable webhook systems, especially when leveraging…

1 条评论
Asset Caching with Service Workers Considering Potential Security Vulnerabilities

2024年8月27日

Asset Caching with Service Workers Considering Potential Security Vulnerabilities

Asset caching is a critical component in web development, particularly for Progressive Web Apps (PWAs). The service…
The Infinite Monkey Theorem: A Metaphor for Goal Setting in the Modern Age

2024年8月19日

The Infinite Monkey Theorem: A Metaphor for Goal Setting in the Modern Age

Have you ever wondered if sheer persistence could lead to success, even if the approach seems random? Enter the…
Cache Poisoning: Threats, Risks, and Prevention Strategies

2024年8月12日

Cache Poisoning: Threats, Risks, and Prevention Strategies

In the constantly evolving world of cybersecurity, cache poisoning is one of the trickier attack methods. Attackers…
Agile is Becoming Futile, Impact Engineering is Here to Stay

2024年8月5日

Agile is Becoming Futile, Impact Engineering is Here to Stay

I’ve been practicing Agile for the last 14+ years. Before that, I worked on waterfall model projects for a while.
The Power of WebAssembly — possibly a silver bullet to server-side rendering (SSR/PSSR) problems

2024年7月29日

The Power of WebAssembly — possibly a silver bullet to server-side rendering (SSR/PSSR) problems

I first learned about WebAssembly in the middle of 2019. I must admit that this theory piqued my interest at first, and…

See all articles

Monitoring, APM, OpenTelemetry, Observability - modern-day requisites for uninterrupted business operations

Amit Pal

Engineering Leader@Egnyte | ERN-stack Architect | Empowering Engineers | Sharing Insights Weekly (WebWiz Newsletter)

Prologue

The DIKW model?

The Monitoring??

The APM

The Observability

The OpenTelemetry (OTEL)

领英推荐

AI-enabled Observability, race with Monitoring, and APM

Future of Observability and impact on business operations

Picking the right tool

WebWiz

756 位关注者

Amit Pal的更多文章

社区洞察

其他会员也浏览了

Monitoring and Observability: Exciting Fields at the Crossroads of Technology, Organizational Strategy, and Human Interaction

DevOps for Tactical and Deployed Environments: Enhancing Defense and Intelligence Operations in Challenging Settings

Understanding the Operational Landscape: SysOps, DataOps, NetOps, DevOps, MLOps, and LLMOps (Part 2 )

Day 49 : Kubernetes Operations - Troubleshooting #90DaysofDevOps

Monitoring and Logging Strategies in DevOps- Your Perfect Solution at NSS

DevOps and Cloud Engineering in Action: Enhancing Mission Readiness for Defense and Intelligence

Top 10 Skills and Activities that Platform Engineers and SRE's rely on every day.

Industry Use Case on Automation using Ansible

Automate the Tedious, Simplify the Complex

Part-1 Rancher Prime Operations

Prologue

The DIKW model?

The Monitoring??

The APM

The Observability

The OpenTelemetry (OTEL)

领英推荐

AI-enabled Observability, race with Monitoring, and APM

Future of Observability and impact on business operations

Picking the right tool

WebWiz

756 位关注者

Amit Pal的更多文章

Concurrency vs. Parallelism in Software Engineering - get rid of your confusion

Building Resilience in Applications: A Comprehensive Guide to Retry Logic

Sprint Burned-Down Chart: Team's Nightmare and Micro Perfectionism in Agile

Does DuplexPair Crack Down WebSockets? A Ninja Technique Unveiled in Node.js 22.6

Dead Letter Queue Management in Webhooks

Asset Caching with Service Workers Considering Potential Security Vulnerabilities

The Infinite Monkey Theorem: A Metaphor for Goal Setting in the Modern Age

Cache Poisoning: Threats, Risks, and Prevention Strategies

Agile is Becoming Futile, Impact Engineering is Here to Stay

The Power of WebAssembly — possibly a silver bullet to server-side rendering (SSR/PSSR) problems

社区洞察

其他会员也浏览了

Monitoring and Observability: Exciting Fields at the Crossroads of Technology, Organizational Strategy, and Human Interaction

DevOps for Tactical and Deployed Environments: Enhancing Defense and Intelligence Operations in Challenging Settings

Understanding the Operational Landscape: SysOps, DataOps, NetOps, DevOps, MLOps, and LLMOps (Part 2 )

Day 49 : Kubernetes Operations - Troubleshooting #90DaysofDevOps

Monitoring and Logging Strategies in DevOps- Your Perfect Solution at NSS

DevOps and Cloud Engineering in Action: Enhancing Mission Readiness for Defense and Intelligence

Top 10 Skills and Activities that Platform Engineers and SRE's rely on every day.

Industry Use Case on Automation using Ansible

Automate the Tedious, Simplify the Complex

Part-1 Rancher Prime Operations