登录查看更多内容

Real Time Infra for Petabytes of Data a Day at Uber

Stanislav Kozlovski

?? 'The Kafka Guy'

发布日期: 2023年7月21日

+ 关注

Uber has paved the way in showing how to both:

? build infrastructure to support massive amounts of data.

? leverage the data in diverse, often conflicting, use cases.

Each day, they process:

TRILLIONS of messages. ??
worth PETABYTES of data. ??

These figures are from October 2020.

As a ballpark example - with 3 trillion messages worth 3 petabytes a day, that would be 34.7GB/s and 34.7 million messages/s.

That is absurdly large scale.

But why do they do that? Why go the lengths to transfer & process so much data? ??

Well. The first question we should ask is - what is real-time data infrastructure good for?

It’s good for serving use cases that require a lot of data to be answered in a timely manner. ??

How else would you be able to do things like:

match a rider and a driver?
optimize the route in real time?
provide a real-time adjusted ETA on when the driver will arrive & when you'll get there?
know what's happening in your global business?

These are simple, first order questions.

But once you provide engineers with access to real-time data, a lot of ideas start to flourish.

Soon, you find yourself swamped with use cases that product managers want to try out, and later productionize.

So how do you scale and model your infrastructure?

According to the use cases!

A naive approach would build a new data pipeline each time a new use case pops up - but that doesn’t scale - neither in cost nor maintainability. ?

Thankfully, technologies today allow us to get the best of both worlds.

And since companies like Uber are typically ahead of the curve (in the amount of data that they collect and the way they use it) – chances are you will hit this soon enough in your company too. ??

So how did Uber do it?

They stacked a bunch of technologies on top of one another, in a logical way.

Interestingly, they shared that the stack grew chronologically in the same manner too – i.e. they started with the basic storage layer, then streaming, then compute, etc.

Here are 4 vastly-different examples from Uber and their solution:

?? 1. UberEats Restaurant Manager

If you own a restaurant that sells via UberEats, you’d like to be able to see how well you’re doing!

The number of customers by hour, the money made, trends, customer reviews and etc. so that you can tell what is happening.

This use case requires:

fresh data.
low latency - the data should load fast - each website page has a few of these dashboards. ??
strict accuracy on some metrics - the dashboards that show you financial data made should not be wrong! ??

The good thing?

The types of queries are fixed!

You know precisely what the web app is going to request.

How did they solve this?

?? They use Pinot with pre-aggregated indices of the large volume of raw records - this greatly reduces the latency in pulling the data.

A fun story with a related internal dashboard is that once their CEO saw inaccurate financial data in an internal dashboard and opened a Jira to the platform team saying “this number is wrong”.

That is the only Jira their CEO ever opened!

? 2. DeepETA

It's 2023. Did you think we wouldn't mention AI?

This time - it’s machine learning.

领英推荐

Happy New Year, Mindex Buzz Subscribers!

Mindex 1 年前

FiftyOne Computer Vision Tips and Tricks - Oct 27, 2023

Voxel51 1 年前

2025 Predictions

Tomasz Tunguz 2 个月前

The way Uber gives you the ETA is two-fold:

a routing engine creates the route and estimates the time based on the distance & real-time traffic data.
a machine-learning model is used to predict the difference between the routing-engine estimated time and the real-world observed outcomes of past trips.

This ML model is like a pet - it constantly needs to be fed and fine-tuned in order to not only stay accurate but also improve (reduce its error rate). ??

A real-time prediction monitoring pipeline performs an ongoing audit (live measurement) of the model’s accuracy.

It does this by joining the ML predictions with the observed outcomes from the data pipeline.

The focus on this use case is one - scalability. ???

Uber has thousands of ML models, each with hundreds of features, that collectively serve several hundreds of thousands of time series, each with millions of data points, computed per second. ????

Uber shared that this is the highest queries-per-second post-processing model in their stack.

There is an absurd amount of volume and cardinality of data that goes into it.

How did they solve this?

Flink. ???

A very large streaming job aggregates metrics & detects prediction abnormality.

They also leverage pre-aggregated Pinot tables here for faster latency.

?? 3. Surge Pricing

Uber has the notion of surge pricing - a key feature that raises or lowers prices according to the real-time supply & demand of riders & drivers.

It’s used to balance the demand through incentives (more money to earn) and disincentives (costs more) so that the service remains reliable.

To know when to raise prices, the server needs to:

have real-time data (data freshness) - you can’t make current pricing decisions on information that’s an hour old.
strict low latency SLA - you can’t take 10 seconds to tell the user what the price will be.
be highly available - if you can’t compute a price, your product doesn’t work!

But it can afford to be eventually consistent.

How did they solve this?

?? Kafka ?? machine learning algo in Flink ?? KV store for quick lookup

?? 4. Internal Dashboard - Ad-Hoc Exploration

Last but not least - when there is a customer ticket, then two, then 10 more ?? - how do you figure out what’s happening and whether it’s related?

Uber provides internal dashboards that support ad-hoc queries for ops personnel to understand what’s happening in the system in real-time.

The big requirement here is great flexibility - you should support ANSI SQL queries like joins, group bys, subqueries, etc.

You can afford to skimp on other aspects:

meh latency - no need to be super fast. ??
low scale - it’s human-driven, so the QPS is single digits. ??♂?
resource usage - given the low scale, you’re typically OK with the system not being very optimized in resource usage. ??

How did they solve this?

Presto. ??

It supports joining data from completely different sources via standard SQL.

These were four example categories of use cases in Uber.

Each has different requirements - some strict, some not.

As it goes to show - the requirements of any modern large business are vast and complex. With it - its underlying data infrastructure as well.

Good architecture is modeled in such a way that is able to continue to scale and answer to the demands of diverse use cases that grow as fast as the business does. ??

Interested in more about how Uber built their real time infra?

It learned most of this information from the following interview with Chinmay Soman and Gwen (Chen) Shapira . Check it out here: ??

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Sai Prakash Reddy M.

SDE 3 - Navi | 2x AWS | Backend Engineer

1 年

Insightful how uber uses pinot for realtime metrics . This blog talks more in detail of how they are done . https://www.uber.com/pt/blog/uber-freight-carrier-metrics-with-near-real-time-analytics/

1 次回应

Michael Kholodovsky

Software Engineer at JPMorgan Chase & Co.

1 年

So then why doesn’t it scale at large events after concerts etc. I found the app starts to be largely unavailable after volume spikes.

1 次回应

查看更多评论

要查看或添加评论，请登录

Stanislav Kozlovski的更多文章

Apache Kafka 3.9.0 Release Summary

2024年11月9日

Apache Kafka 3.9.0 Release Summary

History has been made. This week the final 3.

11 条评论
How Confluent acquired WarpStream for $220m after just 13 months of operation

2024年11月2日

How Confluent acquired WarpStream for $220m after just 13 months of operation

In August of 2023, WarpStream shook up the Kafka industry by announcing a novel Kafka-API compatible cloud-native…

24 条评论
Incremental Cooperative Consumer Group Rebalances

2024年6月20日

Incremental Cooperative Consumer Group Rebalances

Do you want to 2x your Kafka consumers’ throughput during consumer group rebalances? ?? … I may have a something for…

1 条评论
Cloudflare ?? PostgreSQL

2024年3月21日

Cloudflare ?? PostgreSQL

Cloudflare serves around 20% of the web with 46 million requests a second. Surely they must have a lot of data.

3 条评论
Web PKI is Broken

2024年1月25日

Web PKI is Broken

A famous saying: Web PKI revocation is broken. But why is that? (and what does it mean?) Let’s dive in.

2 条评论
ZenDesk's Kafka mTLS Setup

2023年12月5日

ZenDesk's Kafka mTLS Setup

?? Why mTLS? It’s simply a very appealing way of both: encrypting and authenticating your connections. mTLS is a…

11 条评论
How PgBouncer protects PostgreSQL at Cloudflare

2023年8月12日

How PgBouncer protects PostgreSQL at Cloudflare

reading time: 4 minutes. A company like Cloudflare knows a thing or two about protecting systems against client…

1 条评论
?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

2023年8月8日

?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

How did LinkedIn use a sketch algorithm in #ApachePinot to achieve: an 88% reduction of data (1TB → 120GB) improve data…

7 条评论
The Story of Atlassian's 13-Day Outage ????

2023年7月13日

The Story of Atlassian's 13-Day Outage ????

It’s a nice, refreshing spring day. You open up your laptop in the morning and go straight to that design document you…

28 条评论
Kafka Acks Explained

2022年11月6日

Kafka Acks Explained

Visualizing Kafka’s most misunderstood configuration setting Having worked with Kafka for more than four years now…

12 条评论

See all articles

Real Time Infra for Petabytes of Data a Day at Uber

Stanislav Kozlovski

?? 'The Kafka Guy'

So how do you scale and model your infrastructure?

So how did Uber do it?

?? 1. UberEats Restaurant Manager

? 2. DeepETA

领英推荐

?? 3. Surge Pricing

?? 4. Internal Dashboard - Ad-Hoc Exploration

Stanislav Kozlovski的更多文章

社区洞察

其他会员也浏览了

Microsoft Leverages A.I. for Software 3.0 First Mover Advantage

?? Your Drive’s Latest Features Await!

Product Science x Google Cloud Next ’24

From Big Data to Accessible Web Apps: Intuit Open Source thrilled to present at All Things Open Conference

Why and how we built a new platform (from the ground up).

Data Intensive Applications in Financial Services with Jordan Tigani

The 2022 IPO I'm Most Excited About

Engineering that powers SOS emergency on Uber

Recap of Selipsky's keynote at AWS re:Invent

Uber System Architecture Design

So how do you scale and model your infrastructure?

So how did Uber do it?

?? 1. UberEats Restaurant Manager

? 2. DeepETA

领英推荐

?? 3. Surge Pricing

?? 4. Internal Dashboard - Ad-Hoc Exploration

Stanislav Kozlovski的更多文章

Apache Kafka 3.9.0 Release Summary

How Confluent acquired WarpStream for $220m after just 13 months of operation

Incremental Cooperative Consumer Group Rebalances

Cloudflare ?? PostgreSQL

Web PKI is Broken

ZenDesk's Kafka mTLS Setup

How PgBouncer protects PostgreSQL at Cloudflare

?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

The Story of Atlassian's 13-Day Outage ????

Kafka Acks Explained

社区洞察

其他会员也浏览了

Microsoft Leverages A.I. for Software 3.0 First Mover Advantage

?? Your Drive’s Latest Features Await!

Product Science x Google Cloud Next ’24

From Big Data to Accessible Web Apps: Intuit Open Source thrilled to present at All Things Open Conference

Why and how we built a new platform (from the ground up).

Data Intensive Applications in Financial Services with Jordan Tigani

The 2022 IPO I'm Most Excited About

Engineering that powers SOS emergency on Uber

Recap of Selipsky's keynote at AWS re:Invent

Uber System Architecture Design