Monitoring, APM, OpenTelemetry, Observability - modern-day requisites for uninterrupted business operations
downloaded from the internet

Monitoring, APM, OpenTelemetry, Observability - modern-day requisites for uninterrupted business operations

A couple of months ago, I was interviewing a few candidates. I had a heartfelt experience when some of them revealed a fundamental misconception that APM, Observability, and Monitoring are just synonymous. Later I realized that some of my DevOps engineer friends also feel diffident to distinguish the same. Nevertheless, I do not consider myself an SME but gained fundamental comprehension on this matter due to my ~3 years of a stint at AppDynamics in a leadership role. That motivates me to write my 35th article on this topic.

If you find it insightful and appreciate my writing, consider following me for updates on future content. I'm committed to sharing my knowledge and contributing to the coding community. Join me in spreading the word and helping others to learn.

Prologue

The success of any business depends on robust system architecture. A system architecture could be either monolithic or distributed or integrated with single/multiple 3rd party systems, or a combination of all. Be it Monitoring, Observability, or anything else, a surveillance system must be in place for better sustainability. Somewhere on the internet, I glanced through an excellent explanation of how Observability is different from Monitoring in just layman's terms. As I can not remember the reference, quoting is not possible, but definitely state the gist I grasped.?

Let us personify the Monitoring as a factory technician who knows the repairing techniques for typical repetitive faults in machines. Such reactive repairs resolve the issue but do not prevent machine downtime. The person can only address the known unknowns.?

Now consider The Observability is a senior technician who keeps an experienced eye on the factory's central control panel to fathom the preemptive warning signs of errors in a machine. Then proactively address the issue to avert any possible breakdowns.

I hope it illustrates a lot!

The DIKW model?

downloaded from internet

Before we dive into the technical details, we must comprehend why a bunch of data is purposeless if we fail to seek wisdom from it. No matter what or how many datasets you collect, the root-cause problems can never be addressed if they fail to deliver any insight. DIKW model (Data-Information-Knowledge-Wisdom) exemplifies the matter in detail. I would recommend reading through?this article ?first. Also, I'm adding a pictorial guide to get it easily.

The Monitoring??

A monitoring system fetches some predetermined metrics/data from every engineering stack of the system. That dataset is depicted in a dashboard for scrutinization purposes. However, it lacks providing any collective insight or wisdom. This is basically the responsibility of the IT team to analyze the received dataset and take appropriate actions to keep the system functioning. In the case of any complex cloud-native app, it becomes a challenging job for the IT team to monitor everything and make an optimal decision. AppDynamics, Prometheus and Grafana, Datadog, and Dynatrace are a few popular tools available in the market.?

The APM

Downloaded from internet

Application Performance Monitoring (APM) is another type of tool. It solely focuses on overall user experience and application performance. It fetches data like average response time, throughput, network traffic, error rates, predefined business KPIs and SLOs (service level objects), and many more. Then that dataset is depicted in a dashboard to dig through the root cause behind any performance issues. Let me name a few APM tools such as AppDynamics, New Relic, Dynatrace, etc.?

The Observability

downloaded from internet

Observability can be considered a superset to attain 360-degree control over a system. We often call it full-stack Observability as well. Apart from aiding monitoring support, it provides a thorough insight into how various segments are integrated and forecasts issues methodically analyzed by AI. Observability leverages telemetry data to fetch the current state of the engineering stack. It involves collecting traces, logs, events, and metrics across all applications within that stack. It also facilitates observing any transactions and their performance metrics with granular details from start to end and how those transactions were handled by each stack. Overall, it operates in a reactive way that helps developers with debugging, profiling, dependency analysis, and tracing the issue in the whole system. AppDynamics, Signoz, Dynatrace, Datadog, and Splunk, are a few leading tools currently rocking the Observability market.?

The OpenTelemetry (OTEL)

OpenTelemetry is not a platform but rather an open-source Observability framework. It is an open-source project that collects and translates the telemetry data, including MELT (metrics, events, traces, and logs), into a language-agnostic format. Let me explain the rudimentary concepts of the MELT model below.?Please refer to the architecture screenshots as well.

Downloaded from internet

If you want to cultivate it deeply, read this GitHub page or explore the official documentation .

downloaded from internet

  • Metrics: It is either singular or aggregated set measurements accumulated at regular intervals. It features a timestamp, a name, one or more numeric values, and a count of represented events. Metric examples include error rate, response time, or throughput.
  • Events: An event is a discrete action happening at any time. It is used to validate the occurrence of a distinct action at a particular time and enable an exhaustive analysis in real time.?
  • Logs: This is important when engineers are in deep debugging mode, trying to understand an issue and troubleshoot code. The Logs provide high-fidelity data and detailed context around an event, so engineers can recreate what happened at every interval. However, sometimes it represents unstructured data. Then it requires to be structured by using some tools for coherent analysis. However, Logs often get confused with Events. Events contain a higher level of abstraction than the level of detail provided by logs. Logs record everything, whereas events are records of selected things only.
  • Traces: Traces are chains of events. Trace data is needed to determine the relationships between different entities. Traces are elementary for highlighting inefficiencies, bottlenecks, and roadblocks in the service as they show the end-to-end latency of individual calls in a distributed architecture.

Please explore?this link ?for more details on the MELT model. Also, I would surely recommend cultivating a fascinating history of OTEL in?this article .?

I can reckon some of the primary advantages of using OpenTelemetry are:

  • Reduces performance overhead to generate and manage telemetry data as it is equipped with libraries and agents to auto-instrument popular libraries and frameworks requiring minimal changes to your codebase
  • Supports multiple popular programming languages like Java, JavaScript, C++, Python, .NET, etc.?
  • Provides the freedom to switch to new backend analysis tools by using relevant exporters

Although it does instrumentation of data but lacks a visualization layer. Either the Engineering team should develop a custom layer or any other popular tool should be integrated to render the exported OTEL dataset.?

AI-enabled Observability, race with Monitoring, and APM

Nowadays, applications are getting complex with many abstraction layers and keeping it distributed to reduce tight coupling among IT infrastructure. Add to that increasing customer demands for a smooth 24x7x365 experience, the need for quick updates via modern CI/CD pipelines, and the continued evolution of The Great Cloud Migration. Such big MELT data makes IT professionals overwhelmed. That's where Observability and AI pitch in together.?

No alt text provided for this image

By collecting and analyzing the MELT data, Observability tools empower the DevOps team to at least monitor all these data and regain insight into what is happening in their systems. Integrated AI brings predictability in terms of forecasting the issues based on heuristic MELT data. This is something that traditional Monitoring tools fail to do. When the time comes to look beyond Monitoring and managing this morass of next-gen digital eminence, AI leverages the machine learning-powered advantage to make a difference.

No alt text provided for this image

Observability is also leaving the APM behind as it allows teams to quickly find critical issues in their cloud-native, microservices-based apps. Modern microservice architectures increase velocity and scale. Besides that, it also brings painful complexity and unpredictability. Legacy APM tools fail to debug the issue because they were built to examine uncomplicated monolithic applications in predictable environments.?

As AI-enabled Observability brings the ultimate source of truth, many organizations have started adopting it to ease their business operations.

Future of Observability and impact on business operations

There is no stopping as we just embarked on an observability journey. As far as I researched or grasped knowledge from various articles, lectures, or talking to SMEs at APPD, I can jot down a few hypotheses on next-gen Observability opportunities.?

  • Despite all the tools available, troubleshooting is still incredibly hard in some scenarios. So increased adaptation of AI technology is inevitable. Besides forecasting, automatic mitigation of some mechanical issues could be an imminent opportunity.
  • Deep integration of CI/CD is a potential use case to manifest semantic code comprehension. Once it matures, it will enable an emphatic prospect of understanding which piece of newly merged code caused the regression or performance degradation.
  • The next big challenge would be to unify the separate worlds of Observability and business analytics tools. Overall, they are all about slicing and visually dicing data to cognize it as a whole. Business metrics may be impacted by technical problems that eventually may cause a butterfly effect on overall business operations. That's why end-to-end unification is the future goal.?

An excellent?AppD blog ?was published regarding the future of Observability. I would recommend skimming through that article once.

Picking the right tool

Despite the plethora of tools available in the market, picking the right tool is essential. However, this is a broader topic for my 36th article, if I may write in the future. AppDynamics is a futuristic, widely used at the enterprise level, investing a lot in open source OTEL framework. If you feel interested, feel free to explore?this link .?

Andrew Mallaband

Helping Tech Leaders & Innovators To Achieve Exceptional Results

4 个月

Amit this is a very extensive article. Great work. There is something missing that I believe is very important and captured in a post I sent out today. Please take a look and share your thoughts https://www.dhirubhai.net/posts/andrew-mallaband-88b1b7_observability-platformengineering-devops-activity-7219327950637150209-_w2s?utm_source=share&utm_medium=member_ios

要查看或添加评论,请登录

Amit Pal的更多文章

社区洞察

其他会员也浏览了