Design of Real Time Analytics for 69x Price Hike in Uber !

Design of Real Time Analytics for 69x Price Hike in Uber !

?? We'll explore the engineering design philosophy behind features like:

?? The 69x surge-pricing while booking Uber during rainfall.

?? YouTube / LinkedIn Content Analytics Feature. Basically user facing analytics.

No alt text provided for this image

[?]?Learning Outcomes From this article:

?? What / Why / How of Real Time Analytics Systems?
?? Concept of Data warehouses vs Real Time OLAP Data Stores
?? Apache Pinot vs Druid vs Snowflake, BigQuery
?? Utilizing GPU for Optimizing Query Processing Latencies
No alt text provided for this image

[?] What / Why / How of Real Time Analytics Systems?

No alt text provided for this image

?? Businesses like Uber , LinkedIn , Dream11 - need real time low latency insights to run their businesses.

?? Technically speaking, they can't get away with simple batch processing systems

What the hell is a batch processing system?

?? When you process your data from a datastore at a regular cadence of say, an hour / day / week / month etc. then you're essentially processing the DATA in BATCHES.

No alt text provided for this image

?? Technologies like HDFS provide excellent Map Reduce capabilities to parallelise your queries on these data sets.

?? A disadvantage of such system:

You have to wait for all your data to land in storage. And ONLY then you QUERY the data at the boundary of your hour/day/week/month etc.

?? These days, multiple businesses need capabilities to provide their users, low latency insights into their internal analytics in real time. Hence they need to process CONTINUOUS streams of data.

?? Example: LinkedIn / YouTube showcasing the analytics for your content in real time. Like insights into the users who interacted with your content. Their profession, age, location etc.

Existing HDFS alike Systems don't provide low latency querying as a feature and hence are not suitable for this use case.

[?] Concept of Data warehouses vs Real Time OLAP Data Stores

You might wonder, why shouldn't we use something like a Snowflake or 谷歌 's BigQuery or some Data warehouse product?

?? The answer is - these systems work well, really well when you have ONLY internal users running analytical queries on your data set.

Analytical Query Example: "How MANY people liked my Linkedin post"
Transactional Query Example: "Get me LIST of all people who liked my Linkedin post"

?? In an analytical query, you naturally need the storage model to be columnar oriented on the disk as opposed to row based storage. Because of locality of columnar values on disk.

?? However, Snowflake and Google's BigQuery - which provide you columnar storages are still not the best here !!

Why

?? Because, these systems work well for INTERNAL ANALYTICS use cases.

[?] Internal vs External Analytics Workloads

?? Internal:

The max number of concurrent requests for analytical workloads in an internal use case is bounded by the number of queries run by limited employees in your company.

So you don't need to serve disproportionately high QPS on your datastore.

?? External:

However, when you need to expose an analytical workload as a feature to external users, like LinkedIn or YouTube analytics feature - then you can potentially have 800+ Millions of concurrent analytical queries.

So you need to design a system that works really well for such a scale of QPS !!

?? Enter Apache Pinot !

No alt text provided for this image

[?] How Apache Pinot Optimizes delivers low latency sub-millisecond results for SQL like queries?

?? Any analytical SQL query can be dissected into 3 sub-sections:

  • Filtering (WHERE clauses)
  • Scanning the DB (SELECT)
  • Aggregating (GROUP BYs)

Let's avoid JOINs for now ;)

?? For optimising these parts of the query, Apache Pinot builds innovative indexing strategies.

These indexing strategies optimise the data access patterns while reading data from disks.

They range from the familiar general-purpose ones such as:

  • inverted index
  • sorted index
  • range index
  • StarTree indexJSON index.

?? The query processing layer is then able to generate a?per-segment query plan to leverage such indexes for segment-level processing, which in turn accelerates the overall query performance !!

No alt text provided for this image

Diagram showing the hierarchical arragement of a StarTree index in Apache Pinot

The above is a description of how, StarTree Index hierarchically maintains indexes for different pre-computed column combinations to serve ultra low latency queries.

[?] Why not possible in Data Warehouses?

?? Data warehouses are designed for you to run ad-hoc long / short running queries. And hence, building so many indexes would not cover all permutations of use cases that may arise.

?? Whereas, in a user facing analytics application, there would be a limited number of use cases and hence, indexing works like a charm!

[?] Drawbacks of Apache Pinot

?? Systems like Snowflake provide you the ability to separate compute and storage. This means, that you can keep your data in the warehouse and pay for compute, ONLY when you USE the compute engine otherwise just pay for storage !!

?? Whereas in Apache Pinot, you can't afford to bring up compute only when the query is executed, since you can have ANY potential external user of your platform request the results of the queries. So compute and storage needs to co-exist at all times !! Hence more paisa pulled from your pocket.

?? As a result you pay for compute and storage BOTH, regardless if you're using the query capability or not.

?? Since, this technology builds so many optimised indexes, hence it's not useful for ad-hoc queries with bizarre JOINs etc.

[?] Apache Druid vs Apache Pinot

?? Druid is an excellent real time analytics tool but not optimised for external facing analytics.

?? As a result, has limited number of indexes (inverted index), which doesn't help with ultra low latency queries that Pinot helps you with!

[?] Utilizing GPU for Optimizing Query Processing Latencies

No alt text provided for this image

?? To render realistic views of images at a high frame rate, (specially for graphs in Linkedin analytics) GPUs process a massive amount of geometries and pixels in parallel at high speed.

?? While the clock-rate increase for processing units has plateaued over the past few years, the number of transistors on a chip has only increased per?Moore’s law.

?? As a result, GPU computation speeds, measured in Gigaflops per second (GFLOP/s), are rapidly increasing.

?? Hence, Uber experimented and built AresDB: Uber’s GPU-Powered Open Source, Real-time Analytics Engine

?? Read more about it in below link:

[?]?Conclusion:

?? Thank-you very much for making it thus far.

?? Hope you learnt something new about Real Time Analytics Engineering Philosophy.

???Please comment if you liked my article :) :) :)

???? Godspeed !!

No alt text provided for this image

Nice article Nikhil, checkout how Rockset provides the best real-time analytics experience at cloud scale through our blog series at https://rockset.com/blog/

Digvijaysinh Jadeja

Senior Software Engineer - Ads Team | Distributed Systems | Data Platform

2 年

I have an article now to send to the Dev team when they say “Oh why don’t we run our application queries directly on snowflake”.

Tanmay Bhatnagar

Senior Data Scientist at Warner Bros Discovery | Ex-McKinsey

2 年

God Tier newsletter sir !

Nikhil Srivastava

Senior Software Engineer at Confluent

2 年

? Let's connect over a 1x1 here regarding SWE interviews, HLD, LLD :?https://topmate.io/nikhil_srivastava ? If you love my style of writing, then subscribe to my newsletter here !! https://www.dhirubhai.net/newsletters/lld-concurrency-hld-6961694694581964800/

要查看或添加评论,请登录

Nikhil Srivastava的更多文章

  • Apache KAFKA Connect 101 - Part (1/2)

    Apache KAFKA Connect 101 - Part (1/2)

    ?? Learning outcomes from this article: ? Typical Kafka Usage with Producer and Consumer Clients: ? Kafka Connect 101:…

  • Gentle Intro to Data Streaming Landscape

    Gentle Intro to Data Streaming Landscape

    Learning outcomes from this article: What is stream computing and why should YOU learn about it? Typical streaming…

  • Kafka Replication Protocol

    Kafka Replication Protocol

    In this article, I'm going to explain you the Kafka Data Replication Protocol with a very simple example. [?] Learning…

    9 条评论
  • ZERO to HERO in 5 minutes in Apache KAFKA

    ZERO to HERO in 5 minutes in Apache KAFKA

    [?] Learning Outcomes From this article: ?? Introduction to Kafka via Bangalore Traffic Analogy ?? Producers…

    9 条评论
  • Ripping apart the Kafka Broker Architecture !!

    Ripping apart the Kafka Broker Architecture !!

    ?? Just like you, even I love distributed systems. Apache Kafka is such a beautifully designed distributed system that,…

    11 条评论
  • Damn, Threads in web server too !!

    Damn, Threads in web server too !!

    ?? Chronological learning outcomes from this article. Important low level and high level details about the functioning…

    17 条评论
  • Ditch Multi-Threading !! Learn CPU and memory architecture first

    Ditch Multi-Threading !! Learn CPU and memory architecture first

    ?? Don't risk your concurrency interviews without having good grasp on fundamentals. Read further and become a better…

    14 条评论
  • Kubernetes-101 as a noob, from a noob, like a noob !!

    Kubernetes-101 as a noob, from a noob, like a noob !!

    ?? My intention with this article is to help you form some intuition about Kubernetes, Containers, VMs etc. So that you…

    14 条评论
  • Factory Method Pattern for LLD Interviews | BLR Traffic <-> Dating Apps Example

    Factory Method Pattern for LLD Interviews | BLR Traffic <-> Dating Apps Example

    ? Now that I have your attention, if you're preparing for LLD interviews, read further to solidify your understanding…

    7 条评论