Design of Real Time Analytics for 69x Price Hike in Uber !
?? We'll explore the engineering design philosophy behind features like:
?? The 69x surge-pricing while booking Uber during rainfall.
[?]?Learning Outcomes From this article:
?? What / Why / How of Real Time Analytics Systems?
?? Concept of Data warehouses vs Real Time OLAP Data Stores
?? Apache Pinot vs Druid vs Snowflake, BigQuery
?? Utilizing GPU for Optimizing Query Processing Latencies
[?] What / Why / How of Real Time Analytics Systems?
?? Businesses like Uber , LinkedIn , Dream11 - need real time low latency insights to run their businesses.
?? Technically speaking, they can't get away with simple batch processing systems
What the hell is a batch processing system?
?? When you process your data from a datastore at a regular cadence of say, an hour / day / week / month etc. then you're essentially processing the DATA in BATCHES.
?? Technologies like HDFS provide excellent Map Reduce capabilities to parallelise your queries on these data sets.
?? A disadvantage of such system:
You have to wait for all your data to land in storage. And ONLY then you QUERY the data at the boundary of your hour/day/week/month etc.
?? These days, multiple businesses need capabilities to provide their users, low latency insights into their internal analytics in real time. Hence they need to process CONTINUOUS streams of data.
?? Example: LinkedIn / YouTube showcasing the analytics for your content in real time. Like insights into the users who interacted with your content. Their profession, age, location etc.
Existing HDFS alike Systems don't provide low latency querying as a feature and hence are not suitable for this use case.
[?] Concept of Data warehouses vs Real Time OLAP Data Stores
You might wonder, why shouldn't we use something like a Snowflake or 谷歌 's BigQuery or some Data warehouse product?
?? The answer is - these systems work well, really well when you have ONLY internal users running analytical queries on your data set.
Analytical Query Example: "How MANY people liked my Linkedin post"
Transactional Query Example: "Get me LIST of all people who liked my Linkedin post"
?? In an analytical query, you naturally need the storage model to be columnar oriented on the disk as opposed to row based storage. Because of locality of columnar values on disk.
?? However, Snowflake and Google's BigQuery - which provide you columnar storages are still not the best here !!
Why
?? Because, these systems work well for INTERNAL ANALYTICS use cases.
[?] Internal vs External Analytics Workloads
?? Internal:
The max number of concurrent requests for analytical workloads in an internal use case is bounded by the number of queries run by limited employees in your company.
So you don't need to serve disproportionately high QPS on your datastore.
?? External:
However, when you need to expose an analytical workload as a feature to external users, like LinkedIn or YouTube analytics feature - then you can potentially have 800+ Millions of concurrent analytical queries.
So you need to design a system that works really well for such a scale of QPS !!
?? Enter Apache Pinot !
[?] How Apache Pinot Optimizes delivers low latency sub-millisecond results for SQL like queries?
?? Any analytical SQL query can be dissected into 3 sub-sections:
Let's avoid JOINs for now ;)
?? For optimising these parts of the query, Apache Pinot builds innovative indexing strategies.
These indexing strategies optimise the data access patterns while reading data from disks.
They range from the familiar general-purpose ones such as:
?? The query processing layer is then able to generate a?per-segment query plan to leverage such indexes for segment-level processing, which in turn accelerates the overall query performance !!
Diagram showing the hierarchical arragement of a StarTree index in Apache Pinot
The above is a description of how, StarTree Index hierarchically maintains indexes for different pre-computed column combinations to serve ultra low latency queries.
[?] Why not possible in Data Warehouses?
?? Data warehouses are designed for you to run ad-hoc long / short running queries. And hence, building so many indexes would not cover all permutations of use cases that may arise.
?? Whereas, in a user facing analytics application, there would be a limited number of use cases and hence, indexing works like a charm!
[?] Drawbacks of Apache Pinot
?? Systems like Snowflake provide you the ability to separate compute and storage. This means, that you can keep your data in the warehouse and pay for compute, ONLY when you USE the compute engine otherwise just pay for storage !!
?? Whereas in Apache Pinot, you can't afford to bring up compute only when the query is executed, since you can have ANY potential external user of your platform request the results of the queries. So compute and storage needs to co-exist at all times !! Hence more paisa pulled from your pocket.
?? As a result you pay for compute and storage BOTH, regardless if you're using the query capability or not.
?? Since, this technology builds so many optimised indexes, hence it's not useful for ad-hoc queries with bizarre JOINs etc.
[?] Apache Druid vs Apache Pinot
?? Druid is an excellent real time analytics tool but not optimised for external facing analytics.
?? As a result, has limited number of indexes (inverted index), which doesn't help with ultra low latency queries that Pinot helps you with!
[?] Utilizing GPU for Optimizing Query Processing Latencies
?? To render realistic views of images at a high frame rate, (specially for graphs in Linkedin analytics) GPUs process a massive amount of geometries and pixels in parallel at high speed.
?? While the clock-rate increase for processing units has plateaued over the past few years, the number of transistors on a chip has only increased per?Moore’s law.
?? As a result, GPU computation speeds, measured in Gigaflops per second (GFLOP/s), are rapidly increasing.
?? Hence, Uber experimented and built AresDB: Uber’s GPU-Powered Open Source, Real-time Analytics Engine
?? Read more about it in below link:
[?]?Conclusion:
?? Thank-you very much for making it thus far.
?? Hope you learnt something new about Real Time Analytics Engineering Philosophy.
???Please comment if you liked my article :) :) :)
Builder
2 年Nice article Nikhil, checkout how Rockset provides the best real-time analytics experience at cloud scale through our blog series at https://rockset.com/blog/
Senior Software Engineer - Ads Team | Distributed Systems | Data Platform
2 年I have an article now to send to the Dev team when they say “Oh why don’t we run our application queries directly on snowflake”.
Senior Data Scientist at Warner Bros Discovery | Ex-McKinsey
2 年God Tier newsletter sir !
Senior Software Engineer at Confluent
2 年? Let's connect over a 1x1 here regarding SWE interviews, HLD, LLD :?https://topmate.io/nikhil_srivastava ? If you love my style of writing, then subscribe to my newsletter here !! https://www.dhirubhai.net/newsletters/lld-concurrency-hld-6961694694581964800/