登录查看更多内容

Step-by-Step Guide to Incrementally Pulling Data from JDBC with Python and PySpark

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2023年4月18日

As data volumes increase and the need for real-time insights becomes more pressing, businesses are turning to incremental data loading to improve their ETL processes. By only extracting the new data since the last data pull, companies can reduce processing times and improve their data accuracy.

In this blog post, we'll walk through a step-by-step guide to incrementally pulling data from a JDBC source using Python and PySpark. We'll also cover how to leverage source table primary key to pull new data from the source table and avoid a full table scan. Additionally, we'll mention how to use the updated_time in combination with primary key to get updates as well.

Video Guides

Step 1 : Set Up the Environment

First Lets Spin up Postgres

docker-compose up --build

This will start Postgres database on your local machine

Step 2: Create a Table and populate the Table with Fake Data

Run Python File

python

python ingest.py

Python File can be found

https://github.com/soumilshah1995/Step-by-Step-Guide-to-Incrementally-Pulling-Data-from-JDBC-with-Python-and-PySpark/blob/main/ingest.py

We have inserted 100 records in Sales tables

领英推荐

Exploring Data Operations with PySpark, Pandas…

Alex Merced 5 个月前

SQL and Python - Combining the 2 Forces for Advanced…

Muhammad Ishtiaq Khan 8 个月前

How Can You Build Efficient Data Pipelines with Python?

The One Technologies 7 个月前

Step 3: Running PySpark template which pull Incremental Data

Code Explanation

In summary, this code connects to a PostgreSQL database using the JDBC connector in PySpark, extracts data from a specified table incrementally based on the maximum primary key value from the previous extraction, and writes the new maximum primary key value to a checkpoint file. The main function reads the checkpoint file to get the previous maximum primary key value and constructs an incremental query using this value to extract only the new data from the database. The new maximum primary key value is updated based on the extracted data and written to the checkpoint file for use in the next extraction. Finally, the extracted incremental data is displayed on the console.

Output of running Python File template.py

Output of results

Running Template Again

Running Again ingest.py adding some more data and running template to check if new data was pulled it should pull everything from ID 100

Advantages of Incremental Extraction

There are several advantages to performing incremental extraction, including:

Reduced network traffic: Incremental extraction only retrieves the data that has changed since the last extraction, which reduces the amount of data transferred over the network.
Reduced workload on the database: Incremental extraction reduces the workload on the database by only retrieving the data that has changed.
Faster processing time:

TIP

Combining the primary key (which auto-increments) with the record updated date can be an effective way to identify newly inserted records and updated records in a database. By comparing the maximum primary key and the maximum updated date from the previous extraction with the current database records, it is possible to determine which records have been newly inserted or updated since the last extraction. This approach can be useful for incremental extraction of data from a database using PySpark, as it enables the user to efficiently identify and extract only the newly inserted or updated records.

I'm now writing and publishing a blog post about how you can also get updated Records using updated_at column with PK?

Conclusion,

In conclusion, pulling data from JDBC using Python and PySpark can be a daunting task, especially when dealing with large datasets. However, by following the step-by-step guide outlined in this blog, users can easily and efficiently incrementally pull data from JDBC with minimal effort. By utilizing PySpark's powerful features, such as filtering and aggregation, users can extract only the necessary data and improve performance. Additionally, by incorporating a combination of primary key and last updated date, users can not only pull new data but also updates to existing records. This technique can help users stay up-to-date with their data sources and make informed decisions based on the most current information available. Overall, the ability to incrementally pull data from JDBC with Python and PySpark is an invaluable skill for any data professional, and with the right tools and techniques, it can be a straightforward and efficient process.

Shyam Gurunath ??

Software Engineer | Big Data | Distributed Systems | Machine Learning

1 年

Good tutorial Soumil S.

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

2025年3月21日

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Introduction In today's data-driven world, organizations need efficient ways to access and analyze their data stored in…
Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

2025年3月16日

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Introduction Processing large-scale data stored in Amazon S3 quickly and efficiently has always been a challenge. With…

1 条评论
Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

2025年3月13日

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

In the rapidly evolving data landscape, the ability to efficiently store and query complex JSON data has become…

1 条评论
DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

2025年3月13日

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

DuckDB continues to push the boundaries of fast, in-memory analytics by now supporting querying of new S3 table buckets…

3 条评论
Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…

2 条评论
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论

See all articles

Step-by-Step Guide to Incrementally Pulling Data from JDBC with Python and PySpark

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

Step 1 : Set Up the Environment

Step 2: Create a Table and populate the Table with Fake Data

领英推荐

Step 3: Running PySpark template which pull Incremental Data

Code Explanation

Soumil S.的更多文章

社区洞察

其他会员也浏览了

Automating Flight Data Processing with Apache Airflow, Docker, and Python

Mastering Python for Data Engineering: Tools, Techniques, and Real-World Use Cases

MI - ETLx: Incremental Extract and Load Module for Python

Enhancing Data Processing with Aggregate Functions in Snowflake Snowpark

Data Warehousing with Python: A Step-by-Step Guide to Mastery

Accessing Columns in PySpark: A Comprehensive Guide

Comparing Benefits and Limitations of Programming and Query Languages for Data Management

Python vs SQL

Discover 5 cutting-edge data science tools that are essential for your Python toolkit

4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark

Step 1 : Set Up the Environment

Step 2: Create a Table and populate the Table with Fake Data

领英推荐

Step 3: Running PySpark template which pull Incremental Data

Code Explanation

Soumil S.的更多文章

Building a High-Performance Data Analytics Service with Apache Arrow Flight and DuckDB and S3 Tables

Query S3 Tables from AWS Lambda Using DuckDB and Glue IRCC Endpoints

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

社区洞察

其他会员也浏览了

Automating Flight Data Processing with Apache Airflow, Docker, and Python

Mastering Python for Data Engineering: Tools, Techniques, and Real-World Use Cases

MI - ETLx: Incremental Extract and Load Module for Python

Enhancing Data Processing with Aggregate Functions in Snowflake Snowpark

Data Warehousing with Python: A Step-by-Step Guide to Mastery

Accessing Columns in PySpark: A Comprehensive Guide

Comparing Benefits and Limitations of Programming and Query Languages for Data Management

Python vs SQL

Discover 5 cutting-edge data science tools that are essential for your Python toolkit

4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark