Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

Author:

Soumil Nitin Shah

I earned a Bachelor of Science in Electronic Engineering and a double master’s in electrical and computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as Lead DataEngineer where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

?

Divyansh Patel

I'm a highly skilled and motivated professional with a master’s degree in computer science and extensive experience in Data Engineering and AWS Cloud Engineering. I'm currently working with the renowned industry expert Soumil Shah and thrive on tackling complex problems and delivering innovative solutions. My passion for problem-solving and commitment to excellence enable me to make a positive impact on any project or team I work with. I look forward to connecting and collaborating with like-minded professionals.


Project Overview

The project involves building a data lake house architecture using Apache Hudi, which will hold two tables: rides and tips. The rides table will contain information about rides taken on the Uber app, while the tips table will hold data on tips given by customers, which are considered as late arriving updates. The main objective of the project is to demonstrate the power of Apache Hudi and incremental ETL by using checkpointing and pull changes to update the fact table. The fact table will hold business values such as earnings for drivers on a month-over-month or year-over-year basis, as well as the amount of tips they are receiving. the fact table can be joined with a dimension to analyze trends on fares and tips given to drivers over a month or year by joining with the date dimension.


Introduction

In today's world, data is the new oil, and managing it efficiently is essential for any organization's growth. Uber, the world's largest ride-hailing platform, generates vast amounts of data every day, and managing it can be challenging. However, with the help of Lake House Architecture and Incremental ETL using Apache Hudi, Uber has been able to manage their data efficiently and provide near-real-time insights into their business operations.

Advantages of Incremental ETL

Traditional ETL processes involve extracting data from various sources, transforming it into a desired format, and loading it into a destination system, such as a data warehouse. However, this process can be time-consuming, resource-intensive, and can lead to data inconsistencies. Incremental ETL, also known as delta ETL, is a more efficient approach that extracts and transforms only the new or changed data since the last ETL run. This approach has several advantages, including:

  1. Reduced processing time: Since incremental ETL processes only the new or changed data, it significantly reduces the processing time compared to traditional ETL processes.
  2. Near-real-time data processing: Incremental ETL enables near-real-time data processing, allowing organizations to access the latest data and make informed decisions quickly.
  3. Lower resource consumption: Since incremental ETL processes only the new or changed data, it requires fewer resources and reduces the cost of processing data.

Uber's Case Study on Incremental ETL:

Uber has a massive amount of data generated from its ride-hailing platform, which requires a robust data management system. To manage their data, Uber adopted a Lake House Architecture and used incremental ETL to process their data in near-real-time. They used Apache Hudi to create an efficient data pipeline that processes only the new data generated since the last run

For example, Uber used incremental ETL to process their Rides and Tips data. The Rides data records every ride taken on Uber's platform, including information about the driver, rider, route, fare, and other relevant data. The Tips data records every tip paid by riders to drivers. Uber used Hudi to ingest these data sources into their data lake and created a pipeline to process only the new data generated since the last run. This approach helped Uber to reduce the time required to process data and provided near-real-time insights into their business operations.


Video Guide

https://youtu.be/vimesCse7zE


Step by Step guide

Step 1: Define Imports

No alt text provided for this image

Step 2: Define Spark Session

No alt text provided for this image

Step 3: Data generator

No alt text provided for this image

Step 4: Define Method to Perform Upsert into Hudi tables

No alt text provided for this image

Step 5 : Sample preview of dataset

Sample preview for rides dataset

No alt text provided for this image

Sample Preview for Tips Dataset

No alt text provided for this image

Sample Preview for driver dimension

No alt text provided for this image

Preview for Date Dimension

No alt text provided for this image

Creating Hudi tables for tips ,rides, driver and date dimension

No alt text provided for this image

We need Dashboard which shows Earning for Driver with Tips which can be late arriving updates

No alt text provided for this image


We can use incremental ETL to pull data from the rides and tips table and continuously update the fact table. In scenarios where tips are not available, we can do a left join and set the default value as 0. This is because tips can arrive later as late arriving updates. If we have tips for rides that are not available, we can simply update the tips in the fact table since they are late arriving updates.

If we have both tips and rides, we will update both the tips and the base fare in the fact table, and calculate the total earning. If we only have rides but no tips, we will default the tips value to 0, which can later be updated.


Python Utility Helper class for Incremental Data Extraction

https://github.com/soumilshah1995/An-easy-to-use-Python-utility-class-for-accessing-incremental-data-from-Hudi-Data-Lakes

No alt text provided for this image

This is 300 Lines of Python code that helps me to pull data incrementally with checkpoint.

Lets create instance helper utility class

No alt text provided for this image

Performing incremental pull my calling read() method

No alt text provided for this image

Lets create tem View in Spark

No alt text provided for this image

Lets prepare the Spark SQL Query to get Spark Dataframe so we can insert into Fact table

No alt text provided for this image

This code joins rider incremental data with tips incremental data and then if tips is not present for any rides it will default the tip value to 0 as late arriving updates will arrive an later we need to update the fact table

The above code is written in Spark SQL, and it creates a DataFrame called "earning_fact_df" by selecting several columns from the "rides" table and joining it with the "tips" table on the "ride_id" column. The query uses the "LEFT JOIN" clause to include all the rows from the "rides" table, even if there are no matching rows in the "tips" table.

The query also uses the "COALESCE" function to replace any NULL values in the "tip_amount" column with 0. It then calculates the "total_amount" column by adding the "fare" and "tip_amount" columns.

Finally, the query joins the resulting DataFrame with the "date_dim" table on the "date_key" column to get the "date" column using the "to_date" function. The resulting DataFrame has columns for driver_id, ride_id, fare, tip_amount, total_amount, and earning_date_key.

No alt text provided for this image

Upsert the fact data into Hudi fact table

No alt text provided for this image

Say you got late Arriving Tips which is very common people add tip later on tomorrow

Late Arriving Tip

No alt text provided for this image


No alt text provided for this image

Lets see how we can handle this in incremental Fashion

First lets Upsert this late arriving tip into tip hudi table

No alt text provided for this image

Calling Incremental puller

No alt text provided for this image

As you can see we have resumed from checkpoints and now we will be able to fetch newly added tip instead of full table scan

No alt text provided for this image

Perfect

Creating Snapshots so i can perform update on Fact Table

No alt text provided for this image

Creating Spark SQL to update the Tip on Fact table

No alt text provided for this image

Now Upserting New update on Hudi Fact table

No alt text provided for this image



Complete Notebook can be Found

https://soumilshah1995.blogspot.com/2023/04/efficiently-managing-ride-and-late.html


Final Result


No alt text provided for this image

The fact table captures the earnings of drivers based on their rides and tips. By joining the fact table with the "drivers" table, the business can identify the highest-earning drivers, their average rating, and the total number of rides. This information can help in identifying top-performing drivers, incentivizing them, and improving the overall driver experience.

The fact table also captures the earnings based on the date of the ride. By joining the fact table with the "date_dim" table, the business can identify trends in earnings based on the year, month, quarter, and weekday. This information can help in identifying the most profitable time periods for the business and optimizing the supply of drivers during those periods.

Moreover, the fact table captures the fare, tip, and total amounts for each ride. By analyzing these values, the business can identify trends in customer behavior, preferences, and satisfaction. This information can help in improving the quality of service and customer experience, and ultimately, increase the revenue for the business.

In summary, the Star Schema with the fact table "driver_earnings" and related dimension tables can provide valuable insights for the business to optimize driver supply, improve driver experience, identify profitable time periods, and improve the overall customer experience.

Special Thanks

We would like to express our sincere gratitude to Uber for their insightful article on Uber's Lakehouse Architecture, which we found to be informative and engaging. The article sheds light on the innovative approach Uber has taken to handle their vast amounts of data, providing valuable insights into the world of big data and the challenges that organizations face in managing it effectively.

We greatly appreciate Uber's willingness to share their knowledge and experience in this field, which will undoubtedly benefit professionals and organizations working in data management and analytics. We commend Uber for their dedication to staying at the forefront of technology and innovation, and for their commitment to providing high-quality services to their customers.

Once again, we would like to extend our special thanks to Uber for publishing such a valuable resource, and we look forward to learning more from their future contributions in the field.

Link?https://www.uber.com/blog/ubers-lakehouse-architecture/

要查看或添加评论,请登录

Soumil S.的更多文章

社区洞察

其他会员也浏览了