The Impact of Events on Observability in Booking.com

The Impact of Events on Observability in Booking.com

Booking.com runs various distributed services across cloud and on-premises environments, each playing a different role. For instance, “Service X” might handle order processing, while “Service Y” manages inventory. All these services require monitoring to ensure they are performing well and are always available.

They rely on three key areas to monitor these systems: metrics, logs, and traces. To meet most of their observability needs, they use an in-house tool called Booking.com Events. This system helps generate traces, logs, and many metrics, handling tens of millions of events every second.

What is an Event?

An Event is a key-value pair that stores detailed information about a specific task or action which is at the core of Booking.com's observability system.

For example, an Event might represent an HTTP request and could include details like any errors or warnings that occurred during the request, how long it took to process, the number of database queries made, and their latency. It can also include information on A/B tests or other application-specific data.

An Event might look something like this:

{
  "availability_zone" : "new_york",
  "created_epoch": "1724567890.1234",
  "Service_name": "service_X",
  "git_commit_sha": "abc123xyz",
  …
}        

At first glance, Events may seem similar to structured logs, but there are key differences. Logs tend to focus on individual error or status messages, possibly with some extra context. In contrast, Events gather data over time, pulling in information from various parts of the task they’re tracking.


The Role of Events

Booking.com uses Events because they provide a complete picture of what’s happening with a task, whether it’s an HTTP request, a scheduled job, or something else. Events capture everything from user inputs to performance details and the environment where the task is running. This data is then used to create traditional monitoring tools like metrics, logs, and traces. It also enables them to run analytics on the Event data.

Events help answer complex questions that involve multiple systems. For example, if there’s an issue during the flight booking process, Events can help us figure out if it’s only affecting certain users, if bots are causing the problem, or if it’s related to any ongoing experiments on our platform.

Since Events contain detailed information across various parts of the system, they allow seeing data that stretches across different software components at Booking.com.


How Does Booking.com Use Events for Observability?

In Kubernetes, applications create Events using the “Events library.” These Events are then sent to the Event-proxy daemon running on the host machine. The Event-proxy performs three key tasks:

  1. Adds Metadata: It enriches the Event with additional details, like the physical host where it was received.
  2. Routes Events: It sends Events to specific Kafka topics based on custom rules. For example, Events from the order service go to the order-related Kafka topic.
  3. Splits Messages: It breaks down a single Kafka message into smaller ones to make them easier to process.

The process is similar for bare-metal servers. However, cloud-native platforms, like serverless environments (e.g., AWS Lambda), use different tools like OpenTelemetry and CloudWatch instead of the Events system.

Once the Event-proxy sends Events to Kafka clusters, several consumers start processing them for different purposes. Here are three important ones:

  1. Distributed Tracing Consumer: This handles tracing for distributed systems and sends span data to the Honeycomb.
  2. APM Generator: It creates various application performance monitoring (APM) metrics and stores them in Graphite. For example, it tracks the number of actions for each app or failure rates.
  3. Failed Event Processor: This focuses on Events that contain errors or warnings, writing them to ElasticSearch.


References


要查看或添加评论,请登录

社区洞察

其他会员也浏览了