History of OpenTelemetry
Marcel Koert
Innovative Platform Engineer | DevOps Engineer | Site Reliability Engineer | IT Educator | Founder of Melomar-IT
OpenTelemetry (OTEL) is one of the most significant projects in modern observability, offering a set of APIs, libraries, agents, and instrumentation that provide a common foundation for collecting and analyzing telemetry data (metrics, logs, and traces). This data helps developers and operators understand how their software is behaving in real-world environments, often complex distributed systems like microservices architectures. To fully appreciate OTEL, it's essential to dive into its history, starting from its roots in earlier telemetry projects, its key milestones, and its impact on the cloud-native ecosystem.
There is also a video of this blog : https://youtu.be/5vUgFSIoerk
??Early Days of Observability: Logs and Metrics
Before we had sophisticated tools like OpenTelemetry, developers and operators used basic techniques to monitor their systems. Two key tools were logs and metrics:
1. Logs: A traditional approach to debugging and monitoring. Developers inserted print or logging statements in code to capture events at various levels (e.g., error, warning, info). While useful, logs often became hard to manage and were fragmented across systems in distributed architectures.
?2. Metrics: Metrics were numerical data points captured at regular intervals. For example, CPU usage, memory consumption, or request latency are typical metrics. Tools like Prometheus, Graphite, and StatsD allowed teams to gather metrics and visualize them over time. These tools gave high-level insights into the health of a system.
?However, as systems grew in complexity, especially with the rise of cloud computing and microservices, logs and metrics alone were insufficient. A more detailed and holistic way to observe systems became necessary.
?The Rise of Tracing: A New Approach to Observability
?As microservices became the standard for designing scalable applications, the need to understand distributed systems became paramount. Traditional tools failed to provide a cohesive view of what was happening across multiple services. Tracing emerged as a solution to this problem.
?Distributed Tracing provided insight into the lifecycle of requests as they traversed different services. By generating unique trace IDs for each request, developers could track how requests moved through different services, understand performance bottlenecks, and identify where errors occurred.
?Google Dapper (2010): One of the seminal works in this space was Google’s Dapper, which pioneered distributed tracing. Dapper was initially designed for Google’s complex, large-scale distributed systems. It introduced concepts like trace IDs, spans (the units of work within a trace), and the idea of tracing individual requests as they traveled across multiple services. While Dapper was Google-internal, its concepts laid the groundwork for open-source tracing systems.
?OpenTracing (2016): Inspired by Dapper, the OpenTracing project emerged to create a standard API for distributed tracing. OpenTracing provided a set of APIs that developers could use to instrument their applications for tracing in a vendor-neutral way. By using OpenTracing, organizations could switch tracing backends (e.g., Jaeger, Zipkin) without changing the instrumentation in their applications.
?While OpenTracing was a significant step forward, it only focused on traces. As observability needs grew, it became clear that traces alone weren’t sufficient. Observability is made up of three pillars: metrics, logs, and traces. A holistic solution was needed to integrate all three.
?Metrics Standardization: OpenCensus
?Around the same time as OpenTracing, another Google-driven project called OpenCensus was gaining traction. OpenCensus started as an internal project at Google but was later open-sourced to the broader community. While OpenTracing focused on tracing, OpenCensus aimed to provide a more comprehensive solution by supporting both metrics and tracing in a unified API.
?OpenCensus provided libraries for instrumenting applications with both tracing and metrics, offering exporters to backends like Prometheus, Zipkin, and others. It aimed to be a complete observability solution, but its ambition of merging metrics and tracing was still limited in scope.
?The Inception of OpenTelemetry (2019)
?By 2018-2019, it became clear to the industry that two projects (OpenTracing and OpenCensus) were serving similar purposes but with slightly different approaches and goals. While OpenTracing offered a standard for tracing, OpenCensus had a broader scope, aiming to unify both tracing and metrics. Many companies and contributors in the open-source community felt that having two competing standards caused fragmentation.
?To address this, the maintainers of both projects decided to merge their efforts, leading to the creation of OpenTelemetry (OTEL). The merging of OpenTracing and OpenCensus into OpenTelemetry was officially announced in May 2019. The key goals of OTEL were:
?1. Unified Observability: OpenTelemetry would provide a single set of APIs and libraries for collecting metrics, traces, and (eventually) logs. This would eliminate the need for separate instrumentation for each observability signal.
?2. Vendor Neutrality: Similar to OpenTracing, OTEL would be vendor-neutral, meaning that developers could use the same instrumentation regardless of their choice of backend (e.g., Prometheus, Jaeger, or commercial platforms like Datadog or New Relic).
?3. Ease of Use: OpenTelemetry aimed to make instrumentation simpler for developers. By providing auto-instrumentation, comprehensive libraries, and SDKs in multiple languages, OTEL would minimize the friction in adopting observability.
?4. Wide Ecosystem Support: From the beginning, OpenTelemetry received support from a wide range of industry players, including major cloud providers (Google, Microsoft, Amazon), observability vendors (Datadog, Dynatrace, Splunk), and open-source projects (Prometheus, Jaeger).
领英推荐
?The Evolution of OpenTelemetry
?The development of OpenTelemetry has gone through several key phases since its inception:
?Phase 1: Building the Core (2019-2020)
?In its early stages, OpenTelemetry focused on developing the core APIs and SDKs necessary for supporting both traces and metrics. This phase was primarily focused on laying the groundwork for OTEL’s architecture, which included:
?- APIs: These provided a standard interface that developers could use to instrument their applications.
- SDKs: These implemented the API and handled the heavy lifting of managing telemetry data and exporting it to backends.
- Auto-Instrumentation: For many common libraries and frameworks, OpenTelemetry offered out-of-the-box instrumentation that automatically collected traces and metrics without requiring developers to manually add instrumentation code.
?By the end of 2020, the OpenTelemetry project had released its v1.0 for tracing, marking a major milestone. This meant that tracing APIs were stable, and the community could confidently use OpenTelemetry for tracing in production.
?Phase 2: Expanding to Metrics and Logs (2021-2022)
?While OpenTelemetry’s tracing capabilities were the first to reach stability, metrics were the next focus. Metrics were more challenging because of the wide variety of existing standards and backends, such as Prometheus. The community wanted to ensure that OpenTelemetry could seamlessly integrate with existing metrics systems while offering new capabilities.
?By 2022, OpenTelemetry released a stable metrics specification and APIs, marking another major milestone. This allowed developers to instrument their applications for both traces and metrics using a unified API.
?Logs, the third pillar of observability, were also a focus during this period. OpenTelemetry's approach to logs involved providing a bridge between structured logs and traces. While logging support lagged behind metrics and traces, the project aimed to create a seamless experience where logs, metrics, and traces could be correlated for deeper insights.
?Phase 3: Maturing and Widespread Adoption (2022-Present)
?As OpenTelemetry matured, it gained widespread adoption across the industry. Major cloud providers like Google Cloud, AWS, and Microsoft Azure integrated OpenTelemetry support into their observability offerings. Observability vendors like Datadog, Splunk, and New Relic embraced OTEL as a core part of their data collection strategies.
?Some key developments during this period included:
?1. Native Support in Cloud Providers: OpenTelemetry became a first-class citizen in major cloud platforms. AWS integrated OpenTelemetry into its monitoring solutions, and Google Cloud’s Cloud Trace and Cloud Monitoring services provided native OTEL support.
?2. Widespread Use in CNCF Ecosystem: OpenTelemetry became one of the most popular projects in the Cloud Native Computing Foundation (CNCF), reaching graduation status in August 2023. Its integration with Kubernetes and Prometheus made it a standard choice for observing cloud-native applications.
?3. Logs as a Native Citizen: In recent times, more attention has been given to logs within OTEL. While logs were slower to reach maturity compared to traces and metrics, significant progress was made toward providing structured logging capabilities that could integrate with existing log management systems (e.g., ELK Stack, Fluentd).
?4. Standardization and Governance: The OpenTelemetry community continued to focus on evolving the specification and ensuring that new features aligned with the needs of developers and operators. The governance model of OTEL, which involved collaboration between many companies and individuals, ensured that it remained vendor-neutral and community-driven.
?OpenTelemetry's Architecture and Key Components
?To understand the modern impact of OTEL, it's important to highlight its architecture and key components:
?1. API and SDKs: OpenTelemetry offers language-specific APIs and SDKs (e.g., for Java, Python, Go, and C) that developers can use to instrument their applications for traces, metrics, and logs.
?2. Auto-Instrumentation: OpenTelemetry provides libraries that automatically instrument applications by detecting common frameworks (e.g., HTTP, databases) and capturing telemetry data without requiring manual changes.
?3. Collectors: The OpenTelemetry Collector is a key component that allows for receiving, processing, and exporting telemetry data. The Collector can be run as an agent on a host or as a centralized service in a cluster.