Real Time Infra for Petabytes of Data a Day at Uber

Real Time Infra for Petabytes of Data a Day at Uber

Uber has paved the way in showing how to both:

? build infrastructure to support massive amounts of data.

? leverage the data in diverse, often conflicting, use cases.

Each day, they process:

  1. TRILLIONS of messages. ??
  2. worth PETABYTES of data. ??

These figures are from October 2020.

As a ballpark example - with 3 trillion messages worth 3 petabytes a day, that would be 34.7GB/s and 34.7 million messages/s.

That is absurdly large scale.

But why do they do that? Why go the lengths to transfer & process so much data? ??

Well. The first question we should ask is - what is real-time data infrastructure good for?

It’s good for serving use cases that require a lot of data to be answered in a timely manner. ??

How else would you be able to do things like:

  • match a rider and a driver?
  • optimize the route in real time?
  • provide a real-time adjusted ETA on when the driver will arrive & when you'll get there?
  • know what's happening in your global business?

These are simple, first order questions.

But once you provide engineers with access to real-time data, a lot of ideas start to flourish.

Soon, you find yourself swamped with use cases that product managers want to try out, and later productionize.

So how do you scale and model your infrastructure?

According to the use cases!

A naive approach would build a new data pipeline each time a new use case pops up - but that doesn’t scale - neither in cost nor maintainability. ?

Thankfully, technologies today allow us to get the best of both worlds.

And since companies like Uber are typically ahead of the curve (in the amount of data that they collect and the way they use it) – chances are you will hit this soon enough in your company too. ??

So how did Uber do it?

They stacked a bunch of technologies on top of one another, in a logical way.

No alt text provided for this image

Interestingly, they shared that the stack grew chronologically in the same manner too – i.e. they started with the basic storage layer, then streaming, then compute, etc.


Here are 4 vastly-different examples from Uber and their solution:


?? 1. UberEats Restaurant Manager

If you own a restaurant that sells via UberEats, you’d like to be able to see how well you’re doing!

The number of customers by hour, the money made, trends, customer reviews and etc. so that you can tell what is happening.

This use case requires:

  • fresh data.
  • low latency - the data should load fast - each website page has a few of these dashboards. ??
  • strict accuracy on some metrics - the dashboards that show you financial data made should not be wrong! ??

The good thing?

The types of queries are fixed!

You know precisely what the web app is going to request.


How did they solve this?

?? They use Pinot with pre-aggregated indices of the large volume of raw records - this greatly reduces the latency in pulling the data.

A fun story with a related internal dashboard is that once their CEO saw inaccurate financial data in an internal dashboard and opened a Jira to the platform team saying “this number is wrong”.

That is the only Jira their CEO ever opened!

No alt text provided for this image
AI Travis Kalanick is not impressed



? 2. DeepETA

No alt text provided for this image
A Mexican meme ad for Volkswagen

It's 2023. Did you think we wouldn't mention AI?

This time - it’s machine learning.

The way Uber gives you the ETA is two-fold:

  • a routing engine creates the route and estimates the time based on the distance & real-time traffic data.
  • a machine-learning model is used to predict the difference between the routing-engine estimated time and the real-world observed outcomes of past trips.

This ML model is like a pet - it constantly needs to be fed and fine-tuned in order to not only stay accurate but also improve (reduce its error rate). ??

A real-time prediction monitoring pipeline performs an ongoing audit (live measurement) of the model’s accuracy.

It does this by joining the ML predictions with the observed outcomes from the data pipeline.

The focus on this use case is one - scalability. ???

Uber has thousands of ML models, each with hundreds of features, that collectively serve several hundreds of thousands of time series, each with millions of data points, computed per second. ????

Uber shared that this is the highest queries-per-second post-processing model in their stack.

There is an absurd amount of volume and cardinality of data that goes into it.

How did they solve this?

Flink. ???

A very large streaming job aggregates metrics & detects prediction abnormality.

They also leverage pre-aggregated Pinot tables here for faster latency.


?? 3. Surge Pricing

Uber has the notion of surge pricing - a key feature that raises or lowers prices according to the real-time supply & demand of riders & drivers.

It’s used to balance the demand through incentives (more money to earn) and disincentives (costs more) so that the service remains reliable.

No alt text provided for this image
A real-life example of the surge pricing multiplier in action

To know when to raise prices, the server needs to:

  • have real-time data (data freshness) - you can’t make current pricing decisions on information that’s an hour old.
  • strict low latency SLA - you can’t take 10 seconds to tell the user what the price will be.
  • be highly available - if you can’t compute a price, your product doesn’t work!

But it can afford to be eventually consistent.


How did they solve this?


?? Kafka ?? machine learning algo in Flink ?? KV store for quick lookup


?? 4. Internal Dashboard - Ad-Hoc Exploration

Last but not least - when there is a customer ticket, then two, then 10 more ?? - how do you figure out what’s happening and whether it’s related?

Uber provides internal dashboards that support ad-hoc queries for ops personnel to understand what’s happening in the system in real-time.

The big requirement here is great flexibility - you should support ANSI SQL queries like joins, group bys, subqueries, etc.

You can afford to skimp on other aspects:

  • meh latency - no need to be super fast. ??
  • low scale - it’s human-driven, so the QPS is single digits. ??♂?
  • resource usage - given the low scale, you’re typically OK with the system not being very optimized in resource usage. ??


How did they solve this?


Presto. ??

It supports joining data from completely different sources via standard SQL.



These were four example categories of use cases in Uber.

Each has different requirements - some strict, some not.

As it goes to show - the requirements of any modern large business are vast and complex. With it - its underlying data infrastructure as well.

Good architecture is modeled in such a way that is able to continue to scale and answer to the demands of diverse use cases that grow as fast as the business does. ??

Interested in more about how Uber built their real time infra?

It learned most of this information from the following interview with Chinmay Soman and Gwen (Chen) Shapira . Check it out here: ??


Sai Prakash Reddy M.

SDE 3 - Navi | 2x AWS | Backend Engineer

1 年

Insightful how uber uses pinot for realtime metrics . This blog talks more in detail of how they are done . https://www.uber.com/pt/blog/uber-freight-carrier-metrics-with-near-real-time-analytics/

Michael Kholodovsky

Software Engineer at JPMorgan Chase & Co.

1 年

So then why doesn’t it scale at large events after concerts etc. I found the app starts to be largely unavailable after volume spikes.

要查看或添加评论,请登录

Stanislav Kozlovski的更多文章

  • Apache Kafka 3.9.0 Release Summary

    Apache Kafka 3.9.0 Release Summary

    History has been made. This week the final 3.

    11 条评论
  • How Confluent acquired WarpStream for $220m after just 13 months of operation

    How Confluent acquired WarpStream for $220m after just 13 months of operation

    In August of 2023, WarpStream shook up the Kafka industry by announcing a novel Kafka-API compatible cloud-native…

    24 条评论
  • Incremental Cooperative Consumer Group Rebalances

    Incremental Cooperative Consumer Group Rebalances

    Do you want to 2x your Kafka consumers’ throughput during consumer group rebalances? ?? … I may have a something for…

    1 条评论
  • Cloudflare ?? PostgreSQL

    Cloudflare ?? PostgreSQL

    Cloudflare serves around 20% of the web with 46 million requests a second. Surely they must have a lot of data.

    3 条评论
  • Web PKI is Broken

    Web PKI is Broken

    A famous saying: Web PKI revocation is broken. But why is that? (and what does it mean?) Let’s dive in.

    2 条评论
  • ZenDesk's Kafka mTLS Setup

    ZenDesk's Kafka mTLS Setup

    ?? Why mTLS? It’s simply a very appealing way of both: encrypting and authenticating your connections. mTLS is a…

    11 条评论
  • How PgBouncer protects PostgreSQL at Cloudflare

    How PgBouncer protects PostgreSQL at Cloudflare

    reading time: 4 minutes. A company like Cloudflare knows a thing or two about protecting systems against client…

    1 条评论
  • ?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

    ?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

    How did LinkedIn use a sketch algorithm in #ApachePinot to achieve: an 88% reduction of data (1TB → 120GB) improve data…

    7 条评论
  • The Story of Atlassian's 13-Day Outage ????

    The Story of Atlassian's 13-Day Outage ????

    It’s a nice, refreshing spring day. You open up your laptop in the morning and go straight to that design document you…

    28 条评论
  • Kafka Acks Explained

    Kafka Acks Explained

    Visualizing Kafka’s most misunderstood configuration setting Having worked with Kafka for more than four years now…

    12 条评论

社区洞察

其他会员也浏览了