Real Time Infra for Petabytes of Data a Day at Uber
Uber has paved the way in showing how to both:
? build infrastructure to support
? leverage the data
Each day, they process:
These figures are from October 2020.
As a ballpark example - with 3 trillion messages worth 3 petabytes a day, that would be 34.7GB/s and 34.7 million messages/s.
That is absurdly large scale.
But why do they do that? Why go the lengths to transfer & process so much data? ??
Well. The first question we should ask is - what is real-time data infrastructure
It’s good for serving use cases that require a lot of data to be answered in a timely manner. ??
How else would you be able to do things like:
These are simple, first order questions.
But once you provide engineers with access to real-time data, a lot of ideas start to flourish.
Soon, you find yourself swamped with use cases that product managers want to try out, and later productionize.
So how do you scale and model your infrastructure?
According to the use cases!
A naive approach would build a new data pipeline each time a new use case pops up - but that doesn’t scale - neither in cost nor maintainability. ?
Thankfully, technologies today allow us to get the best of both worlds.
And since companies like Uber are typically ahead of the curve (in the amount of data that they collect and the way they use it) – chances are you will hit this soon enough in your company too. ??
So how did Uber do it?
They stacked a bunch of technologies on top of one another, in a logical way.
Interestingly, they shared that the stack grew chronologically in the same manner too – i.e. they started with the basic storage layer, then streaming, then compute, etc.
Here are 4 vastly-different examples from Uber and their solution:
?? 1. UberEats Restaurant Manager
If you own a restaurant that sells via UberEats, you’d like to be able to see how well you’re doing!
The number of customers by hour, the money made, trends, customer reviews and etc. so that you can tell what is happening.
This use case requires:
The good thing?
The types of queries are fixed!
You know precisely what the web app is going to request.
How did they solve this?
?? They use Pinot with pre-aggregated indices of the large volume of raw records - this greatly reduces the latency in pulling the data.
A fun story with a related internal dashboard is that once their CEO saw inaccurate financial data in an internal dashboard and opened a Jira to the platform team saying “this number is wrong”.
That is the only Jira their CEO ever opened!
? 2. DeepETA
It's 2023. Did you think we wouldn't mention AI?
This time - it’s machine learning.
领英推荐
The way Uber gives you the ETA is two-fold:
This ML model is like a pet - it constantly needs to be fed and fine-tuned in order to not only stay accurate but also improve (reduce its error rate). ??
A real-time prediction monitoring pipeline performs an ongoing audit (live measurement) of the model’s accuracy.
It does this by joining the ML predictions with the observed outcomes from the data pipeline.
The focus on this use case is one - scalability
Uber has thousands of ML models, each with hundreds of features, that collectively serve several hundreds of thousands of time series, each with millions of data points, computed per second. ????
Uber shared that this is the highest queries-per-second post-processing model in their stack.
There is an absurd amount of volume and cardinality of data that goes into it.
How did they solve this?
Flink. ???
A very large streaming job aggregates metrics & detects prediction abnormality.
They also leverage pre-aggregated Pinot tables here for faster latency.
?? 3. Surge Pricing
Uber has the notion of surge pricing - a key feature that raises or lowers prices according to the real-time supply & demand of riders & drivers.
It’s used to balance the demand through incentives (more money to earn) and disincentives (costs more) so that the service remains reliable.
To know when to raise prices, the server needs to:
But it can afford to be eventually consistent.
How did they solve this?
?? Kafka ?? machine learning algo in Flink ?? KV store for quick lookup
?? 4. Internal Dashboard - Ad-Hoc Exploration
Last but not least - when there is a customer ticket, then two, then 10 more ?? - how do you figure out what’s happening and whether it’s related?
Uber provides internal dashboards that support ad-hoc queries for ops personnel to understand what’s happening in the system in real-time.
The big requirement here is great flexibility - you should support ANSI SQL queries like joins, group bys, subqueries, etc.
You can afford to skimp on other aspects:
How did they solve this?
Presto. ??
It supports joining data from completely different sources via standard SQL.
These were four example categories of use cases in Uber.
Each has different requirements - some strict, some not.
As it goes to show - the requirements of any modern large business are vast and complex. With it - its underlying data infrastructure as well.
Good architecture is modeled in such a way that is able to continue to scale and answer to the demands of diverse use cases that grow as fast as the business does. ??
Interested in more about how Uber built their real time infra?
It learned most of this information from the following interview with Chinmay Soman and Gwen (Chen) Shapira . Check it out here: ??
SDE 3 - Navi | 2x AWS | Backend Engineer
1 年Insightful how uber uses pinot for realtime metrics . This blog talks more in detail of how they are done . https://www.uber.com/pt/blog/uber-freight-carrier-metrics-with-near-real-time-analytics/
Software Engineer at JPMorgan Chase & Co.
1 年So then why doesn’t it scale at large events after concerts etc. I found the app starts to be largely unavailable after volume spikes.