Migrating to OpenTelemetry
Marcel Koert
Freelance (DEV/OPS,CLOUD,Site Reliability, Platform) engineer. AT this time working for ING. And I am Microsoft Azure Administrator Associate, got my certification 31 July 2020.
This can also be found as a video on Youtube : https://youtu.be/Gs9FXEUEMZM
Migrating to OpenTelemetry (OTEL) from a traditional pull-based monitoring system can be a transformative process for organizations aiming to adopt more modern, scalable, and unified observability solutions. In this guide, we’ll discuss the strategies, steps, and challenges involved in migrating from a pull-based system, which typically relies on tools like Prometheus, Nagios, or Zabbix, to OpenTelemetry, a push-based system that centralizes the collection of traces, metrics, and logs.
?This comprehensive discussion will cover why you might want to migrate, how OpenTelemetry differs from pull-based systems, and the key considerations in ensuring a successful migration. We will also walk through the technical steps required to transition from your existing monitoring and observability system to OpenTelemetry.
?1. Understanding the Difference: Pull-Based vs. Push-Based Systems
Before diving into the migration process, it’s important to understand the fundamental differences between pull-based and push-based observability models. This context will help you plan and adjust your current monitoring workflows as you transition to OpenTelemetry.
??* Pull-Based Systems
?In pull-based systems, a central monitoring service periodically pulls metrics from various services or endpoints. These systems often rely on scraping HTTP endpoints to retrieve metrics at regular intervals. Some key characteristics include:
?- Centralized polling: A central monitoring server (like Prometheus) pulls data from various endpoints exposed by applications or infrastructure.
- Intermittent collection: Metrics are collected at fixed intervals (e.g., every 15 seconds, every minute), meaning visibility can be limited by the polling frequency.
- Focus on metrics: Pull-based systems are predominantly metrics-centric, collecting data like CPU usage, memory consumption, and request latencies.
?* Push-Based Systems (OpenTelemetry)
In push-based systems, services push telemetry data to a centralized collector or backend. OpenTelemetry is built around this model and supports pushing metrics, traces, and logs. Some of its characteristics include:
- Service-driven telemetry: Instead of relying on a central server, services send telemetry data to an OpenTelemetry Collector.
- Real-time telemetry: Data can be sent immediately after it's generated, which allows for real-time monitoring and observability without delays caused by periodic polling.
- Unified observability: OpenTelemetry collects not only metrics but also traces and logs, offering a more complete view of the system.
Migrating from pull to push-based systems involves changing the way telemetry is collected, processed, and managed. You will move from scraping metrics to having your services push telemetry data to a centralized system.
?2. Why Migrate to OpenTelemetry?
?Migrating to OpenTelemetry offers several advantages over traditional pull-based systems. Understanding these benefits will help you justify the effort and investment required for migration.
* Unified Observability Across Signals
?While pull-based systems focus mainly on metrics, OpenTelemetry provides a unified framework for traces, metrics, and logs. This unified approach allows for better correlation between different telemetry types, enabling more powerful root cause analysis and troubleshooting.
?For example, with OpenTelemetry, you can link traces (representing a request’s journey across services) with relevant metrics (such as CPU usage or response times) and logs (which provide detailed event information), all within the same observability ecosystem.
?* Real-Time Monitoring
OpenTelemetry offers real-time telemetry, as data is pushed immediately after it is generated. In contrast, pull-based systems collect data only during scheduled scrapes, which can lead to delayed visibility and potential gaps during incidents. With OpenTelemetry, you can detect and respond to issues in real time, minimizing downtime and improving overall system reliability.?
* Scalability and Flexibility
?As systems grow more distributed and complex, traditional pull-based monitoring systems may struggle with scaling due to the increasing number of endpoints to scrape. OpenTelemetry’s architecture is more scalable for modern cloud-native applications, where each service independently sends its telemetry data. This is especially important in environments with microservices, serverless architectures, or edge computing.
?OpenTelemetry also offers the flexibility to work with multiple backends. You can collect data in one format and send it to various analytics tools or observability platforms like Prometheus, Grafana, Datadog, or even custom-built systems, making it adaptable to different monitoring needs.
?3. Planning the Migration: Key Considerations
?A successful migration requires careful planning to avoid disruptions in your monitoring capabilities. The key considerations below will help guide your planning process.
?1. Inventory of Monitored Services
?The first step is to take an inventory of all the services currently being monitored by your pull-based system. For each service, identify the following:
- Telemetry being collected: Are you collecting metrics, logs, traces, or all three? If you’re currently only monitoring metrics, consider how OpenTelemetry will enhance your observability by adding tracing and logging capabilities.
- Endpoints being scraped: Identify the endpoints that your pull-based system scrapes for metrics, as these will need to be replaced by OpenTelemetry instrumentation.
- Critical services: Prioritize services that are most critical to your business. You might want to migrate these services first to ensure their telemetry is robust and real-time.
?2. Instrumentation
?The biggest technical change in moving to OpenTelemetry will be instrumenting your code to emit telemetry data. In a pull-based system, your services likely expose HTTP endpoints that provide metrics in a specific format (like Prometheus). In OpenTelemetry, you’ll need to update your services to:
?- Use OpenTelemetry SDKs to emit traces and metrics.
- Ensure context propagation across distributed systems, so traces can link together transactions that span multiple services.
?For existing applications, auto-instrumentation might be available for some languages and frameworks (such as Java, Python, Node.js, and .NET). However, for custom applications, manual instrumentation may be required.
?3. Collector and Exporters
The OpenTelemetry Collector will be a central component in your migration. The Collector receives telemetry data from your instrumented services and processes it before sending it to the appropriate backend systems. You'll need to configure the Collector with:
- Receivers: These accept telemetry data in different formats from your services.
- Processors: These handle transformations, filtering, and batching of data.
- Exporters: These send the telemetry data to your chosen observability platform(s).?
If you're currently using Prometheus, for instance, you can set up the Collector to export metrics to a Prometheus-compatible format, allowing you to continue using existing monitoring tools during and after the migration.
?4. Transition Plan
?You should plan for a phased migration that minimizes disruptions to your monitoring setup. Here's a basic phased approach:
- Phase 1: Dual Operation: During this phase, you can run both the pull-based system and OpenTelemetry in parallel. This will allow you to compare the telemetry data from both systems and ensure that the new instrumentation is working correctly.
- Phase 2: Migrate Critical Services: Begin migrating critical services that are key to your business operations. This will help identify any issues early in the process without impacting your entire infrastructure.
- Phase 3: Full Transition: Once you’re confident in the OpenTelemetry setup, complete the migration for all services, eventually retiring the pull-based system.
?4. Technical Steps for Migrating to OpenTelemetry
?Once the planning is in place, the actual migration process can begin. Below is a detailed step-by-step guide to transitioning your monitoring system from pull-based to OpenTelemetry.
?Step 1: Set Up OpenTelemetry Collector
The first technical step is to deploy the OpenTelemetry Collector in your environment. The Collector can be deployed as a standalone process, as a sidecar, or even as a gateway. Your deployment model will depend on your architecture, but for most distributed systems, deploying it as a gateway or as part of a centralized observability pipeline is common.
1. Install the Collector: Follow the OpenTelemetry Collector installation instructions for your platform.
2. Configure Receivers: Configure the Collector to receive telemetry data from your services. For example, if you’re migrating from Prometheus, you can set up a Prometheus receiver to scrape Prometheus-compatible endpoints while instrumenting services with OpenTelemetry.
3. Set Up Exporters: Configure exporters to send telemetry data to your chosen observability backend (such as Jaeger for tracing, Prometheus for metrics, or Loki for logs).
?Step 2: Instrument Services for Metrics
Once the Collector is set up, the next step is to instrument your services to push metrics directly to the Collector. If you are using Prometheus as a pull-based system, replace the Prometheus endpoint with the OpenTelemetry SDK for your programming language.
1. Auto-Instrumentation: In many cases, you can use auto-instrumentation libraries for popular frameworks and languages like Java, Python, or Go to automatically start collecting metrics with minimal code changes.
?2. Manual Instrumentation: For custom or less common frameworks, you will need to add manual instrumentation using the OpenTelemetry SDK. You will define the metrics you want to collect (e.g., request latencies, error counts) and send them to the OpenTelemetry Collector.
?Step 3: Instrument Services for Tracing
Tracing is a powerful capability provided by OpenTelemetry and a major upgrade from traditional metrics-only systems. Traces allow you to follow a request as it flows through your system, which is especially useful in microservices architectures.?
1. Auto-Instrumentation for Tracing: Similar to metrics, many libraries offer auto-instrumentation for tracing. This can automatically trace incoming HTTP requests, outgoing API calls, database queries, and more.