The Impact of Events on Observability in Booking.com
Booking.com runs various distributed services across cloud and on-premises environments, each playing a different role. For instance, “Service X” might handle order processing, while “Service Y” manages inventory. All these services require monitoring to ensure they are performing well and are always available.
They rely on three key areas to monitor these systems: metrics, logs, and traces. To meet most of their observability needs, they use an in-house tool called Booking.com Events. This system helps generate traces, logs, and many metrics, handling tens of millions of events every second.
What is an Event?
An Event is a key-value pair that stores detailed information about a specific task or action which is at the core of Booking.com's observability system.
For example, an Event might represent an HTTP request and could include details like any errors or warnings that occurred during the request, how long it took to process, the number of database queries made, and their latency. It can also include information on A/B tests or other application-specific data.
An Event might look something like this:
{
"availability_zone" : "new_york",
"created_epoch": "1724567890.1234",
"Service_name": "service_X",
"git_commit_sha": "abc123xyz",
…
}
At first glance, Events may seem similar to structured logs, but there are key differences. Logs tend to focus on individual error or status messages, possibly with some extra context. In contrast, Events gather data over time, pulling in information from various parts of the task they’re tracking.
The Role of Events
Booking.com uses Events because they provide a complete picture of what’s happening with a task, whether it’s an HTTP request, a scheduled job, or something else. Events capture everything from user inputs to performance details and the environment where the task is running. This data is then used to create traditional monitoring tools like metrics, logs, and traces. It also enables them to run analytics on the Event data.
Events help answer complex questions that involve multiple systems. For example, if there’s an issue during the flight booking process, Events can help us figure out if it’s only affecting certain users, if bots are causing the problem, or if it’s related to any ongoing experiments on our platform.
Since Events contain detailed information across various parts of the system, they allow seeing data that stretches across different software components at Booking.com.
领英推荐
How Does Booking.com Use Events for Observability?
In Kubernetes, applications create Events using the “Events library.” These Events are then sent to the Event-proxy daemon running on the host machine. The Event-proxy performs three key tasks:
The process is similar for bare-metal servers. However, cloud-native platforms, like serverless environments (e.g., AWS Lambda), use different tools like OpenTelemetry and CloudWatch instead of the Events system.
Once the Event-proxy sends Events to Kafka clusters, several consumers start processing them for different purposes. Here are three important ones:
References