登录查看更多内容

Design of Real Time Analytics for 69x Price Hike in Uber !

Nikhil Srivastava

Senior Software Engineer at Confluent

发布日期: 2022年9月18日

+ 关注

?? We'll explore the engineering design philosophy behind features like:

?? The 69x surge-pricing while booking Uber during rainfall.

?? YouTube / LinkedIn Content Analytics Feature. Basically user facing analytics.

[?]?Learning Outcomes From this article:

?? What / Why / How of Real Time Analytics Systems?

?? Concept of Data warehouses vs Real Time OLAP Data Stores

?? Apache Pinot vs Druid vs Snowflake, BigQuery

?? Utilizing GPU for Optimizing Query Processing Latencies

[?] What / Why / How of Real Time Analytics Systems?

?? Businesses like Uber , LinkedIn , Dream11 - need real time low latency insights to run their businesses.

?? Technically speaking, they can't get away with simple batch processing systems

What the hell is a batch processing system?

?? When you process your data from a datastore at a regular cadence of say, an hour / day / week / month etc. then you're essentially processing the DATA in BATCHES.

?? Technologies like HDFS provide excellent Map Reduce capabilities to parallelise your queries on these data sets.

?? A disadvantage of such system:

You have to wait for all your data to land in storage. And ONLY then you QUERY the data at the boundary of your hour/day/week/month etc.

?? These days, multiple businesses need capabilities to provide their users, low latency insights into their internal analytics in real time. Hence they need to process CONTINUOUS streams of data.

?? Example: LinkedIn / YouTube showcasing the analytics for your content in real time. Like insights into the users who interacted with your content. Their profession, age, location etc.

Existing HDFS alike Systems don't provide low latency querying as a feature and hence are not suitable for this use case.

[?] Concept of Data warehouses vs Real Time OLAP Data Stores

You might wonder, why shouldn't we use something like a Snowflake or 谷歌 's BigQuery or some Data warehouse product?

?? The answer is - these systems work well, really well when you have ONLY internal users running analytical queries on your data set.

Analytical Query Example: "How MANY people liked my Linkedin post"

Transactional Query Example: "Get me LIST of all people who liked my Linkedin post"

?? In an analytical query, you naturally need the storage model to be columnar oriented on the disk as opposed to row based storage. Because of locality of columnar values on disk.

?? However, Snowflake and Google's BigQuery - which provide you columnar storages are still not the best here !!

Why

?? Because, these systems work well for INTERNAL ANALYTICS use cases.

[?] Internal vs External Analytics Workloads

?? Internal:

The max number of concurrent requests for analytical workloads in an internal use case is bounded by the number of queries run by limited employees in your company.

So you don't need to serve disproportionately high QPS on your datastore.

?? External:

However, when you need to expose an analytical workload as a feature to external users, like LinkedIn or YouTube analytics feature - then you can potentially have 800+ Millions of concurrent analytical queries.

So you need to design a system that works really well for such a scale of QPS !!

?? Enter Apache Pinot !

[?] How Apache Pinot Optimizes delivers low latency sub-millisecond results for SQL like queries?

?? Any analytical SQL query can be dissected into 3 sub-sections:

Filtering (WHERE clauses)
Scanning the DB (SELECT)
Aggregating (GROUP BYs)

Let's avoid JOINs for now ;)

?? For optimising these parts of the query, Apache Pinot builds innovative indexing strategies.

These indexing strategies optimise the data access patterns while reading data from disks.

They range from the familiar general-purpose ones such as:

inverted index
sorted index
range index
StarTree indexJSON index.

?? The query processing layer is then able to generate a?per-segment query plan to leverage such indexes for segment-level processing, which in turn accelerates the overall query performance !!

Diagram showing the hierarchical arragement of a StarTree index in Apache Pinot

The above is a description of how, StarTree Index hierarchically maintains indexes for different pre-computed column combinations to serve ultra low latency queries.

[?] Why not possible in Data Warehouses?

?? Data warehouses are designed for you to run ad-hoc long / short running queries. And hence, building so many indexes would not cover all permutations of use cases that may arise.

?? Whereas, in a user facing analytics application, there would be a limited number of use cases and hence, indexing works like a charm!

[?] Drawbacks of Apache Pinot

?? Systems like Snowflake provide you the ability to separate compute and storage. This means, that you can keep your data in the warehouse and pay for compute, ONLY when you USE the compute engine otherwise just pay for storage !!

?? Whereas in Apache Pinot, you can't afford to bring up compute only when the query is executed, since you can have ANY potential external user of your platform request the results of the queries. So compute and storage needs to co-exist at all times !! Hence more paisa pulled from your pocket.

?? As a result you pay for compute and storage BOTH, regardless if you're using the query capability or not.

?? Since, this technology builds so many optimised indexes, hence it's not useful for ad-hoc queries with bizarre JOINs etc.

[?] Apache Druid vs Apache Pinot

?? Druid is an excellent real time analytics tool but not optimised for external facing analytics.

?? As a result, has limited number of indexes (inverted index), which doesn't help with ultra low latency queries that Pinot helps you with!

[?] Utilizing GPU for Optimizing Query Processing Latencies

?? To render realistic views of images at a high frame rate, (specially for graphs in Linkedin analytics) GPUs process a massive amount of geometries and pixels in parallel at high speed.

?? While the clock-rate increase for processing units has plateaued over the past few years, the number of transistors on a chip has only increased per?Moore’s law.

?? As a result, GPU computation speeds, measured in Gigaflops per second (GFLOP/s), are rapidly increasing.

?? Hence, Uber experimented and built AresDB: Uber’s GPU-Powered Open Source, Real-time Analytics Engine

?? Read more about it in below link:

[?]?Conclusion:

?? Thank-you very much for making it thus far.

?? Hope you learnt something new about Real Time Analytics Engineering Philosophy.

???Please comment if you liked my article :) :) :)

???? Godspeed !!

Nithin Venkatesh

Builder

2 年

Nice article Nikhil, checkout how Rockset provides the best real-time analytics experience at cloud scale through our blog series at https://rockset.com/blog/

1 次回应

Digvijaysinh Jadeja

Senior Software Engineer - Ads Team | Distributed Systems | Data Platform

2 年

I have an article now to send to the Dev team when they say “Oh why don’t we run our application queries directly on snowflake”.

1 次回应

Tanmay Bhatnagar

Senior Data Scientist at Warner Bros Discovery | Ex-McKinsey

2 年

God Tier newsletter sir !

1 次回应

Nikhil Srivastava

Senior Software Engineer at Confluent

2 年

? Let's connect over a 1x1 here regarding SWE interviews, HLD, LLD :?https://topmate.io/nikhil_srivastava ? If you love my style of writing, then subscribe to my newsletter here !! https://www.dhirubhai.net/newsletters/lld-concurrency-hld-6961694694581964800/

1 次回应

查看更多评论

要查看或添加评论，请登录

Nikhil Srivastava的更多文章

Apache KAFKA Connect 101 - Part (1/2)

2023年6月9日

Apache KAFKA Connect 101 - Part (1/2)

?? Learning outcomes from this article: ? Typical Kafka Usage with Producer and Consumer Clients: ? Kafka Connect 101:…
Gentle Intro to Data Streaming Landscape

2023年5月31日

Gentle Intro to Data Streaming Landscape

Learning outcomes from this article: What is stream computing and why should YOU learn about it? Typical streaming…
Kafka Replication Protocol

2022年9月12日

Kafka Replication Protocol

In this article, I'm going to explain you the Kafka Data Replication Protocol with a very simple example. [?] Learning…

9 条评论
ZERO to HERO in 5 minutes in Apache KAFKA

2022年9月4日

ZERO to HERO in 5 minutes in Apache KAFKA

[?] Learning Outcomes From this article: ?? Introduction to Kafka via Bangalore Traffic Analogy ?? Producers…

9 条评论
Ripping apart the Kafka Broker Architecture !!

2022年8月28日

Ripping apart the Kafka Broker Architecture !!

?? Just like you, even I love distributed systems. Apache Kafka is such a beautifully designed distributed system that,…

11 条评论
Damn, Threads in web server too !!

2022年8月21日

Damn, Threads in web server too !!

?? Chronological learning outcomes from this article. Important low level and high level details about the functioning…

17 条评论
Ditch Multi-Threading !! Learn CPU and memory architecture first

2022年8月15日

Ditch Multi-Threading !! Learn CPU and memory architecture first

?? Don't risk your concurrency interviews without having good grasp on fundamentals. Read further and become a better…

14 条评论
Kubernetes-101 as a noob, from a noob, like a noob !!

2022年8月13日

Kubernetes-101 as a noob, from a noob, like a noob !!

?? My intention with this article is to help you form some intuition about Kubernetes, Containers, VMs etc. So that you…

14 条评论
Factory Method Pattern for LLD Interviews | BLR Traffic <-> Dating Apps Example

2022年8月7日

Factory Method Pattern for LLD Interviews | BLR Traffic <-> Dating Apps Example

? Now that I have your attention, if you're preparing for LLD interviews, read further to solidify your understanding…

7 条评论

See all articles

?? We'll explore the engineering design philosophy behind features like:

[?]?Learning Outcomes From this article:

[?] What / Why / How of Real Time Analytics Systems?

?? A disadvantage of such system:

[?] Concept of Data warehouses vs Real Time OLAP Data Stores

?? Because, these systems work well for INTERNAL ANALYTICS use cases.

[?] Internal vs External Analytics Workloads

?? Internal:

?? External:

?? Enter Apache Pinot !

[?] How Apache Pinot Optimizes delivers low latency sub-millisecond results for SQL like queries?

[?] Why not possible in Data Warehouses?

[?] Drawbacks of Apache Pinot

[?] Apache Druid vs Apache Pinot

[?] Utilizing GPU for Optimizing Query Processing Latencies

[?]?Conclusion:

???? Godspeed !!

Nikhil Srivastava的更多文章

Apache KAFKA Connect 101 - Part (1/2)

Gentle Intro to Data Streaming Landscape

Kafka Replication Protocol

ZERO to HERO in 5 minutes in Apache KAFKA

Ripping apart the Kafka Broker Architecture !!

Damn, Threads in web server too !!

Ditch Multi-Threading !! Learn CPU and memory architecture first

Kubernetes-101 as a noob, from a noob, like a noob !!

Factory Method Pattern for LLD Interviews | BLR Traffic <-> Dating Apps Example