登录查看更多内容

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2023年4月28日

+ 关注

Author:

Soumil Nitin Shah

I earned a Bachelor of Science in Electronic Engineering and a double master’s in electrical and computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as Lead DataEngineer where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

Divyansh Patel

I'm a highly skilled and motivated professional with a master’s degree in computer science and extensive experience in Data Engineering and AWS Cloud Engineering. I'm currently working with the renowned industry expert Soumil Shah and thrive on tackling complex problems and delivering innovative solutions. My passion for problem-solving and commitment to excellence enable me to make a positive impact on any project or team I work with. I look forward to connecting and collaborating with like-minded professionals.

Project Overview

The project involves building a data lake house architecture using Apache Hudi, which will hold two tables: rides and tips. The rides table will contain information about rides taken on the Uber app, while the tips table will hold data on tips given by customers, which are considered as late arriving updates. The main objective of the project is to demonstrate the power of Apache Hudi and incremental ETL by using checkpointing and pull changes to update the fact table. The fact table will hold business values such as earnings for drivers on a month-over-month or year-over-year basis, as well as the amount of tips they are receiving. the fact table can be joined with a dimension to analyze trends on fares and tips given to drivers over a month or year by joining with the date dimension.

Introduction

In today's world, data is the new oil, and managing it efficiently is essential for any organization's growth. Uber, the world's largest ride-hailing platform, generates vast amounts of data every day, and managing it can be challenging. However, with the help of Lake House Architecture and Incremental ETL using Apache Hudi, Uber has been able to manage their data efficiently and provide near-real-time insights into their business operations.

Advantages of Incremental ETL

Traditional ETL processes involve extracting data from various sources, transforming it into a desired format, and loading it into a destination system, such as a data warehouse. However, this process can be time-consuming, resource-intensive, and can lead to data inconsistencies. Incremental ETL, also known as delta ETL, is a more efficient approach that extracts and transforms only the new or changed data since the last ETL run. This approach has several advantages, including:

Reduced processing time: Since incremental ETL processes only the new or changed data, it significantly reduces the processing time compared to traditional ETL processes.
Near-real-time data processing: Incremental ETL enables near-real-time data processing, allowing organizations to access the latest data and make informed decisions quickly.
Lower resource consumption: Since incremental ETL processes only the new or changed data, it requires fewer resources and reduces the cost of processing data.

Uber's Case Study on Incremental ETL:

Uber has a massive amount of data generated from its ride-hailing platform, which requires a robust data management system. To manage their data, Uber adopted a Lake House Architecture and used incremental ETL to process their data in near-real-time. They used Apache Hudi to create an efficient data pipeline that processes only the new data generated since the last run

For example, Uber used incremental ETL to process their Rides and Tips data. The Rides data records every ride taken on Uber's platform, including information about the driver, rider, route, fare, and other relevant data. The Tips data records every tip paid by riders to drivers. Uber used Hudi to ingest these data sources into their data lake and created a pipeline to process only the new data generated since the last run. This approach helped Uber to reduce the time required to process data and provided near-real-time insights into their business operations.

Video Guide

https://youtu.be/vimesCse7zE

Step by Step guide

Step 1: Define Imports

Step 2: Define Spark Session

Step 3: Data generator

Step 4: Define Method to Perform Upsert into Hudi tables

Step 5 : Sample preview of dataset

Sample preview for rides dataset

Sample Preview for Tips Dataset

Sample Preview for driver dimension

Preview for Date Dimension

Creating Hudi tables for tips ,rides, driver and date dimension

We need Dashboard which shows Earning for Driver with Tips which can be late arriving updates

We can use incremental ETL to pull data from the rides and tips table and continuously update the fact table. In scenarios where tips are not available, we can do a left join and set the default value as 0. This is because tips can arrive later as late arriving updates. If we have tips for rides that are not available, we can simply update the tips in the fact table since they are late arriving updates.

If we have both tips and rides, we will update both the tips and the base fare in the fact table, and calculate the total earning. If we only have rides but no tips, we will default the tips value to 0, which can later be updated.

Python Utility Helper class for Incremental Data Extraction

https://github.com/soumilshah1995/An-easy-to-use-Python-utility-class-for-accessing-incremental-data-from-Hudi-Data-Lakes

领英推荐

Just Enough Spark! Core Concepts Revisited !!

Deepak Rajak 4 年前

LakeBoost:Maximizing Efficiency in Data Lake (Hudi)…

Soumil S. 1 年前

Harnessing the Power of Elasticsearch: boosting your…

Saigon Technology - Accelerate Software Development 1 年前

This is 300 Lines of Python code that helps me to pull data incrementally with checkpoint.

Lets create instance helper utility class

Performing incremental pull my calling read() method

Lets create tem View in Spark

Lets prepare the Spark SQL Query to get Spark Dataframe so we can insert into Fact table

This code joins rider incremental data with tips incremental data and then if tips is not present for any rides it will default the tip value to 0 as late arriving updates will arrive an later we need to update the fact table

The above code is written in Spark SQL, and it creates a DataFrame called "earning_fact_df" by selecting several columns from the "rides" table and joining it with the "tips" table on the "ride_id" column. The query uses the "LEFT JOIN" clause to include all the rows from the "rides" table, even if there are no matching rows in the "tips" table.

The query also uses the "COALESCE" function to replace any NULL values in the "tip_amount" column with 0. It then calculates the "total_amount" column by adding the "fare" and "tip_amount" columns.

Finally, the query joins the resulting DataFrame with the "date_dim" table on the "date_key" column to get the "date" column using the "to_date" function. The resulting DataFrame has columns for driver_id, ride_id, fare, tip_amount, total_amount, and earning_date_key.

Upsert the fact data into Hudi fact table

Say you got late Arriving Tips which is very common people add tip later on tomorrow

Late Arriving Tip

Lets see how we can handle this in incremental Fashion

First lets Upsert this late arriving tip into tip hudi table

Calling Incremental puller

As you can see we have resumed from checkpoints and now we will be able to fetch newly added tip instead of full table scan

Perfect

Creating Snapshots so i can perform update on Fact Table

Creating Spark SQL to update the Tip on Fact table

Now Upserting New update on Hudi Fact table

Complete Notebook can be Found

https://soumilshah1995.blogspot.com/2023/04/efficiently-managing-ride-and-late.html

Final Result

The fact table captures the earnings of drivers based on their rides and tips. By joining the fact table with the "drivers" table, the business can identify the highest-earning drivers, their average rating, and the total number of rides. This information can help in identifying top-performing drivers, incentivizing them, and improving the overall driver experience.

The fact table also captures the earnings based on the date of the ride. By joining the fact table with the "date_dim" table, the business can identify trends in earnings based on the year, month, quarter, and weekday. This information can help in identifying the most profitable time periods for the business and optimizing the supply of drivers during those periods.

Moreover, the fact table captures the fare, tip, and total amounts for each ride. By analyzing these values, the business can identify trends in customer behavior, preferences, and satisfaction. This information can help in improving the quality of service and customer experience, and ultimately, increase the revenue for the business.

In summary, the Star Schema with the fact table "driver_earnings" and related dimension tables can provide valuable insights for the business to optimize driver supply, improve driver experience, identify profitable time periods, and improve the overall customer experience.

Special Thanks

We would like to express our sincere gratitude to Uber for their insightful article on Uber's Lakehouse Architecture, which we found to be informative and engaging. The article sheds light on the innovative approach Uber has taken to handle their vast amounts of data, providing valuable insights into the world of big data and the challenges that organizations face in managing it effectively.

We greatly appreciate Uber's willingness to share their knowledge and experience in this field, which will undoubtedly benefit professionals and organizations working in data management and analytics. We commend Uber for their dedication to staying at the forefront of technology and innovation, and for their commitment to providing high-quality services to their customers.

Once again, we would like to extend our special thanks to Uber for publishing such a valuable resource, and we look forward to learning more from their future contributions in the field.

Link?https://www.uber.com/blog/ubers-lakehouse-architecture/

要查看或添加评论，请登录

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

3 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论

See all articles

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

Uber's Case Study on Incremental ETL:

Step by Step guide

Step 2: Define Spark Session

Step 3: Data generator

Step 4: Define Method to Perform Upsert into Hudi tables

Step 5 : Sample preview of dataset

We need Dashboard which shows Earning for Driver with Tips which can be late arriving updates

领英推荐

Say you got late Arriving Tips which is very common people add tip later on tomorrow

Calling Incremental puller

Special Thanks

Link?https://www.uber.com/blog/ubers-lakehouse-architecture/

Soumil S.的更多文章

社区洞察

其他会员也浏览了

Harnessing the Power of Elasticsearch: boosting your search capabilities

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

Using Airbyte with Tabular

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

Improving Legacy Code: Using Task Queue to Speed Up a Crawler in an ETL Process

Cloudera to Snowflake/AWS migration using CI/CD

Building an ETL Pipeline to Process Web-Scraped Data to DB and Visualizing Data Using pgAdmin 4

Apache Spark 3.0 for Data Scientists : Best Practices

What is Delta Live Tables?

Uber's Case Study on Incremental ETL:

Step by Step guide

Step 2: Define Spark Session

Step 3: Data generator

Step 4: Define Method to Perform Upsert into Hudi tables

Step 5 : Sample preview of dataset

We need Dashboard which shows Earning for Driver with Tips which can be late arriving updates

领英推荐

Say you got late Arriving Tips which is very common people add tip later on tomorrow

Calling Incremental puller

Special Thanks

Link?https://www.uber.com/blog/ubers-lakehouse-architecture/

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

社区洞察

其他会员也浏览了

Harnessing the Power of Elasticsearch: boosting your search capabilities

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

Using Airbyte with Tabular

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

Improving Legacy Code: Using Task Queue to Speed Up a Crawler in an ETL Process

Cloudera to Snowflake/AWS migration using CI/CD

Building an ETL Pipeline to Process Web-Scraped Data to DB and Visualizing Data Using pgAdmin 4

Apache Spark 3.0 for Data Scientists : Best Practices

What is Delta Live Tables?