Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide
Author:
Soumil Nitin Shah
I earned a Bachelor of Science in Electronic Engineering and a double master’s in electrical and computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as Lead DataEngineer where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search
?
Divyansh Patel
I'm a highly skilled and motivated professional with a master’s degree in computer science and extensive experience in Data Engineering and AWS Cloud Engineering. I'm currently working with the renowned industry expert Soumil Shah and thrive on tackling complex problems and delivering innovative solutions. My passion for problem-solving and commitment to excellence enable me to make a positive impact on any project or team I work with. I look forward to connecting and collaborating with like-minded professionals.
Project Overview
The project involves building a data lake house architecture using Apache Hudi, which will hold two tables: rides and tips. The rides table will contain information about rides taken on the Uber app, while the tips table will hold data on tips given by customers, which are considered as late arriving updates. The main objective of the project is to demonstrate the power of Apache Hudi and incremental ETL by using checkpointing and pull changes to update the fact table. The fact table will hold business values such as earnings for drivers on a month-over-month or year-over-year basis, as well as the amount of tips they are receiving. the fact table can be joined with a dimension to analyze trends on fares and tips given to drivers over a month or year by joining with the date dimension.
Introduction
In today's world, data is the new oil, and managing it efficiently is essential for any organization's growth. Uber, the world's largest ride-hailing platform, generates vast amounts of data every day, and managing it can be challenging. However, with the help of Lake House Architecture and Incremental ETL using Apache Hudi, Uber has been able to manage their data efficiently and provide near-real-time insights into their business operations.
Advantages of Incremental ETL
Traditional ETL processes involve extracting data from various sources, transforming it into a desired format, and loading it into a destination system, such as a data warehouse. However, this process can be time-consuming, resource-intensive, and can lead to data inconsistencies. Incremental ETL, also known as delta ETL, is a more efficient approach that extracts and transforms only the new or changed data since the last ETL run. This approach has several advantages, including:
Uber's Case Study on Incremental ETL:
Uber has a massive amount of data generated from its ride-hailing platform, which requires a robust data management system. To manage their data, Uber adopted a Lake House Architecture and used incremental ETL to process their data in near-real-time. They used Apache Hudi to create an efficient data pipeline that processes only the new data generated since the last run
For example, Uber used incremental ETL to process their Rides and Tips data. The Rides data records every ride taken on Uber's platform, including information about the driver, rider, route, fare, and other relevant data. The Tips data records every tip paid by riders to drivers. Uber used Hudi to ingest these data sources into their data lake and created a pipeline to process only the new data generated since the last run. This approach helped Uber to reduce the time required to process data and provided near-real-time insights into their business operations.
Video Guide
https://youtu.be/vimesCse7zE
Step by Step guide
Step 1: Define Imports
Step 2: Define Spark Session
Step 3: Data generator
Step 4: Define Method to Perform Upsert into Hudi tables
Step 5 : Sample preview of dataset
Sample preview for rides dataset
Sample Preview for Tips Dataset
Sample Preview for driver dimension
Preview for Date Dimension
Creating Hudi tables for tips ,rides, driver and date dimension
We need Dashboard which shows Earning for Driver with Tips which can be late arriving updates
We can use incremental ETL to pull data from the rides and tips table and continuously update the fact table. In scenarios where tips are not available, we can do a left join and set the default value as 0. This is because tips can arrive later as late arriving updates. If we have tips for rides that are not available, we can simply update the tips in the fact table since they are late arriving updates.
If we have both tips and rides, we will update both the tips and the base fare in the fact table, and calculate the total earning. If we only have rides but no tips, we will default the tips value to 0, which can later be updated.
Python Utility Helper class for Incremental Data Extraction
https://github.com/soumilshah1995/An-easy-to-use-Python-utility-class-for-accessing-incremental-data-from-Hudi-Data-Lakes
领英推荐
This is 300 Lines of Python code that helps me to pull data incrementally with checkpoint.
Lets create instance helper utility class
Performing incremental pull my calling read() method
Lets create tem View in Spark
Lets prepare the Spark SQL Query to get Spark Dataframe so we can insert into Fact table
This code joins rider incremental data with tips incremental data and then if tips is not present for any rides it will default the tip value to 0 as late arriving updates will arrive an later we need to update the fact table
The above code is written in Spark SQL, and it creates a DataFrame called "earning_fact_df" by selecting several columns from the "rides" table and joining it with the "tips" table on the "ride_id" column. The query uses the "LEFT JOIN" clause to include all the rows from the "rides" table, even if there are no matching rows in the "tips" table.
The query also uses the "COALESCE" function to replace any NULL values in the "tip_amount" column with 0. It then calculates the "total_amount" column by adding the "fare" and "tip_amount" columns.
Finally, the query joins the resulting DataFrame with the "date_dim" table on the "date_key" column to get the "date" column using the "to_date" function. The resulting DataFrame has columns for driver_id, ride_id, fare, tip_amount, total_amount, and earning_date_key.
Upsert the fact data into Hudi fact table
Say you got late Arriving Tips which is very common people add tip later on tomorrow
Late Arriving Tip
Lets see how we can handle this in incremental Fashion
First lets Upsert this late arriving tip into tip hudi table
Calling Incremental puller
As you can see we have resumed from checkpoints and now we will be able to fetch newly added tip instead of full table scan
Perfect
Creating Snapshots so i can perform update on Fact Table
Creating Spark SQL to update the Tip on Fact table
Now Upserting New update on Hudi Fact table
Complete Notebook can be Found
https://soumilshah1995.blogspot.com/2023/04/efficiently-managing-ride-and-late.html
Final Result
The fact table captures the earnings of drivers based on their rides and tips. By joining the fact table with the "drivers" table, the business can identify the highest-earning drivers, their average rating, and the total number of rides. This information can help in identifying top-performing drivers, incentivizing them, and improving the overall driver experience.
The fact table also captures the earnings based on the date of the ride. By joining the fact table with the "date_dim" table, the business can identify trends in earnings based on the year, month, quarter, and weekday. This information can help in identifying the most profitable time periods for the business and optimizing the supply of drivers during those periods.
Moreover, the fact table captures the fare, tip, and total amounts for each ride. By analyzing these values, the business can identify trends in customer behavior, preferences, and satisfaction. This information can help in improving the quality of service and customer experience, and ultimately, increase the revenue for the business.
In summary, the Star Schema with the fact table "driver_earnings" and related dimension tables can provide valuable insights for the business to optimize driver supply, improve driver experience, identify profitable time periods, and improve the overall customer experience.
Special Thanks
We would like to express our sincere gratitude to Uber for their insightful article on Uber's Lakehouse Architecture, which we found to be informative and engaging. The article sheds light on the innovative approach Uber has taken to handle their vast amounts of data, providing valuable insights into the world of big data and the challenges that organizations face in managing it effectively.
We greatly appreciate Uber's willingness to share their knowledge and experience in this field, which will undoubtedly benefit professionals and organizations working in data management and analytics. We commend Uber for their dedication to staying at the forefront of technology and innovation, and for their commitment to providing high-quality services to their customers.
Once again, we would like to extend our special thanks to Uber for publishing such a valuable resource, and we look forward to learning more from their future contributions in the field.