Exploring the Evolution of Observability: From 1.0 to 2.0 from an SRE Perspective
Marcel Koert
Freelance (DEV/OPS,CLOUD,Site Reliability, Platform) engineer. AT this time working for ING. And I am Microsoft Azure Administrator Associate, got my certification 31 July 2020.
In the realm of Site Reliability Engineering (SRE), one of the most critical aspects of ensuring that systems remain available, performant, and resilient is observability. The concept of observability is rooted in control theory, where it refers to the ability to understand the internal states of a system through its external outputs. Over time, the implementation of observability within software systems has evolved dramatically. This evolution is driven by the increasing complexity of systems, as they have shifted from monolithic architectures to distributed, cloud-native microservices.
Observability has advanced through distinct phases, which we can categorize as Observability 1.0 and Observability 2.0. While both approaches aim to provide insight into the behavior of complex systems, their methodologies, tools, and objectives differ. This essay will explore the core differences between these two phases of observability, explaining key terminology and presenting examples and case studies to illustrate how these shifts have impacted the work of SREs.
Understanding Observability
?Before delving into the differences between Observability 1.0 and 2.0, it’s essential to understand what observability means in the context of software systems. Observability is the practice of collecting, analyzing, and correlating data from a system’s outputs to infer its internal state. In the SRE world, observability is critical for identifying issues, diagnosing problems, and ensuring that a system meets its Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
Traditional observability is built on three key types of telemetry data, often referred to as the three pillars of observability:
????????????? 1.???????? Metrics: Quantitative data that measures the performance of system components over time (e.g., CPU utilization, memory consumption, request latency).
????????????? 2.???????? Logs: Textual records of events that capture detailed information about specific operations or behaviors within the system (e.g., errors, transactions, requests).
????????????? 3.???????? Traces: Distributed traces capture the flow of a single transaction or request as it traverses through multiple services in a microservices architecture, helping to identify bottlenecks or failures across services.
While these three data types are the foundation of observability, the way they are collected, processed, and correlated has changed significantly over time.
Observability 1.0: The Early Phase
Observability 1.0 represents the early approach to monitoring and observing systems. It evolved during a time when systems were typically built as monolithic applications or relatively simple distributed systems. In this phase, the three pillars of observability—metrics, logs, and traces—were treated as separate, siloed components, with little or no integration between them. SREs and engineers often had to rely on manual efforts to correlate data across different tools.
?
Key Characteristics of Observability 1.0
?
????????????? 1.???????? Siloed Tools and Data Sources:
In Observability 1.0, metrics, logs, and traces were typically collected by different tools. For instance, an organization might use Prometheus for metrics, Elasticsearch for logs, and Jaeger for traces. These tools often lacked integration, meaning that engineers had to manually correlate data between these systems during incidents or outages. This approach slowed down incident resolution and made it difficult to gain a unified view of the system.
????????????? 2.???????? Threshold-Based Alerting:
The primary mode of alerting in Observability 1.0 was based on static thresholds. For example, an alert might be triggered if CPU usage exceeded 80% or if the response time for a specific API call went beyond 500 milliseconds. While effective for detecting known issues, threshold-based monitoring struggled with novel or emergent problems, particularly in dynamic environments like cloud-native systems.
????????????? 3.???????? Reactive Troubleshooting:
Observability 1.0 was inherently reactive. Alerts were triggered only after something went wrong, forcing SREs to manually investigate logs, metrics, and traces to find the root cause. This process was often time-consuming, and diagnosing the issue required expert knowledge of each tool, as well as the ability to correlate data across them.
????????????? 4.???????? Limited Scalability:
Observability 1.0 tools were not designed to handle the vast amounts of telemetry data generated by modern distributed systems. As organizations adopted microservices, containerization, and cloud infrastructure, the volume of data produced by these systems skyrocketed. This resulted in performance bottlenecks and increased costs when using traditional observability tools.
????????????? 5.???????? Fragmented and Disconnected Views:
Since Observability 1.0 tools were often disconnected from one another, SREs lacked a holistic view of system health. They might be able to view metrics for a specific component or examine logs for an individual service, but piecing together the entire system’s behavior required significant manual effort.
?
领英推荐
Case Study: Observability 1.0 in a Monolithic E-Commerce System
?Consider an e-commerce platform built using a monolithic architecture. In this system, SREs monitor metrics such as request rates, response times, and error counts using Prometheus. When an issue arises, such as slow checkout times, an alert is triggered based on predefined thresholds for response times.
?An SRE then examines logs in Elasticsearch to check for errors or exceptions that might explain the latency issue. If the logs don’t provide sufficient information, the SRE may turn to Jaeger or Zipkin to trace the flow of a customer’s checkout request through the system.
This manual process of switching between tools to investigate the issue is time-consuming, and there is no guarantee that the root cause will be immediately apparent. The siloed nature of the tools makes it difficult to correlate data, increasing the Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR).
Observability 2.0: A Modern Approach
As software systems have grown more complex, the limitations of Observability 1.0 have become increasingly apparent. In response, the observability landscape has evolved into what is now known as Observability 2.0. This new approach to observability is designed for the challenges posed by distributed, cloud-native architectures, where systems are composed of hundreds or even thousands of microservices. Observability 2.0 emphasizes integration, automation, and real-time insights, allowing SREs to be more proactive and efficient in maintaining system reliability.
?
Key Characteristics of Observability 2.0
?
????????????? 1.???????? Unified Telemetry Data Platform:
One of the defining characteristics of Observability 2.0 is the integration of metrics, logs, and traces into a single platform. Tools like Datadog, Honeycomb, and New Relic collect and store all telemetry data in one place, providing SREs with a unified view of system behavior. This eliminates the need for manual correlation and makes it easier to diagnose issues quickly.
????????????? 2.???????? Context-Enriched Insights:
Observability 2.0 tools don’t just collect raw telemetry data—they also enrich it with context. This might include information about recent deployments, configuration changes, or user behavior. For example, if an SRE is investigating a spike in error rates, they can quickly see that a new version of the checkout service was deployed shortly before the issue occurred. This contextual information helps SREs identify the root cause faster.
????????????? 3.???????? AI-Driven Anomaly Detection:
One of the most significant advancements in Observability 2.0 is the use of machine learning and artificial intelligence to detect anomalies and predict potential failures. Rather than relying solely on static thresholds, AI-driven observability tools can automatically identify patterns or deviations from normal behavior, even in highly dynamic environments. This allows SREs to detect issues before they escalate into major incidents.
????????????? 4.???????? High-Resolution, Real-Time Telemetry:
Observability 2.0 emphasizes real-time monitoring with high granularity. Rather than relying on aggregated or sampled data, SREs can access detailed telemetry data for individual transactions. This high-resolution data allows for more precise troubleshooting and better understanding of system behavior at a granular level.
????????????? 5.???????? Proactive and Predictive Monitoring:
Observability 2.0 shifts from a reactive model to a proactive one. By leveraging machine learning and advanced analytics, these platforms can predict issues before they impact users. For example, a predictive alert might warn an SRE that disk space on a key service will run out in the next 24 hours, allowing them to address the issue before it causes downtime.
?
Case Study: Observability 2.0 in a Microservices-Based Streaming Platform
?Imagine a modern video streaming platform built using microservices and hosted on Kubernetes. This platform generates vast amounts of telemetry data, including metrics for each container, logs for each service, and traces for every request.
In an Observability 2.0 setup, a platform like Datadog or Honeycomb collects all of this data in real-time and provides SREs with a unified view of system health. If a customer experiences buffering while streaming a video, an SRE can instantly view traces, logs, and metrics for the affected request. They can see, for example, that a recent deployment of the video encoding service introduced a memory leak, causing latency to spike.
Additionally, AI-driven anomaly detection might identify a pattern of increasing latency in the video delivery service, even before customers start reporting issues. This proactive monitoring allows the SRE team to roll back the deployment and fix the issue before it impacts more users.
Netflix to identify potential failures before they
?Conclusion
The transition from Observability 1.0 to Observability 2.0 represents a major shift in how SREs monitor, diagnose, and maintain modern systems. While Observability 1.0 was sufficient for simpler, monolithic architectures, the demands of cloud-native, distributed systems necessitated a new approach—one that integrates data, automates analysis, and scales to meet the needs of dynamic, complex environments. Observability 2.0 empowers SREs with real-time, context-rich insights, enabling them to be proactive in identifying and resolving issues, ultimately improving system reliability and performance.
Site Reliability Engineering emphasizes the “RELIABILITY” while incorporating Dev Ops practices. We “shift-left” w/ Dev teams to leverage modern engineering practices to build world-class scalable and secure products
2 个月Great article ! In some organizations, pivoting to Observability 2.0 also means leveraging aggregation through telemetry visualization tools like Grafana. The AI component is still something to yearn for. I hope the next versions of these products come with open APIs to fully bring together multiple toolsets on a single pane..