登录查看更多内容

Unlocking Incremental Data in PySpark: Extracting from JDBC Sources without Debezium or AWS DMS with CDC

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

发布日期: 2023年4月19日

+ 关注

Video Based Tutorials

Authors?

Soumil Nitin ShahI earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as Lead DataEngineer where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

Divyansh Patel

I'm a highly skilled and motivated professional with a Master's degree in Computer Science and extensive experience in Data Engineering and AWS Cloud Engineering. I'm currently working with the renowned industry expert Soumil Shah and thrive on tackling complex problems and delivering innovative solutions. My passion for problem-solving and commitment to excellence enable me to make a positive impact on any project or team I work with. I look forward to connecting and collaborating with like-minded professionals

Introduction?

Data is the lifeblood of any organization, and the ability to extract and process it efficiently is crucial for making informed business decisions. But extracting data from databases can be a time-consuming and resource-intensive task, especially when dealing with large datasets. Fortunately, PySpark provides a powerful toolset for extracting, processing, and analyzing data efficiently, making it an ideal choice for many data extraction tasks.

In this article, we'll explore how PySpark can be used to extract incremental data from JDBC sources without the need for Debezium or AWS DMS. We'll discuss the advantages of using primary keys (PK) and updated_at columns to extract updated and newly inserted data, and how this approach can be used to pull data from any source database using JDBC.

What is Incremental Data Processing?

Incremental data processing is a technique for processing only the data that has changed since the last time it was processed, rather than processing the entire dataset every time. This approach can significantly reduce processing time and resource usage, especially when dealing with large datasets.

The primary advantages of incremental data processing are efficiency and cost-effectiveness. By processing only the changed data, you can save time and resources, and reduce the amount of data that needs to be stored and processed. This approach can be particularly beneficial when dealing with large datasets that are frequently updated, such as social media feeds, financial transactions, or sensor data.

Hands on Labs

Step 1: Spin up Postgres Database using Docker Compose?

docker-compose up --build

This will start Postgres database on your local machine

Step 2:?Create a Table and populate the Table with Fake Data

Run Python File

python ingest.py

Python file can be found

https://github.com/soumilshah1995/Unlocking-Incremental-Data-in-PySpark-Extracting-from-JDBC-Sources-without-Debezium-or-AWS-DMS-with/blob/main/ingest.py

Python Script creates a table called sales in public schema

Now we will create a trigger which mean automatically when a record is updated column updated_at will automatically update as well?

Creating Trigger

This code is creating a PostgreSQL function and trigger that updates the "updated_at" column of a table called "sales" every time a row is updated.

The function "update_sales_updated_at()" takes no arguments and returns a "TRIGGER" object. The function sets the "updated_at" column of the "NEW" row to the current timestamp using the "CURRENT_TIMESTAMP" function and returns the "NEW" row.

领英推荐

Simplifying Data Processing with PySpark on Amazon…

Coditation 1 年前

Unlock the Power of Big Data with PySpark Training by…

Multisoft Systems 4 个月前

How to implement Apache Spark in Data Processing and…

Spiral Mantra 9 个月前

The trigger "update_sales_updated_at_trigger" is created using the "CREATE TRIGGER" statement. The trigger is set to execute the "update_sales_updated_at()" function before every update on the "public.sales" table for each row being updated

We have inserted 100 records in Sales tables

Step 3: Running PySpark template which pull Incremental Data

I will explain entire code logic at end in detailed manner

Now if i run template again i expect no data to be returned?

Now lets Update a record and see if template can record it?

Lets Run the template again to see if we can capture this new changes

Deep Dive into code and Logic?

We define the imports

We declare the settings

Code Logic :

If a checkpoint does not exist, the script will assume that the user is running the template for the first time and will pull all data at once. Moving forward, I want to pull incremental data. We load the most recent maximum id and updated date into variables, and if checkpoints exist, we set first_time_read to False, indicating that checkpoints exist.?

This are two helper class which are Holds paramaters such as max ID and updated date and other process information into flags as shown in figure

Main Logic which was explained in flow charts

Main Logic

This technology, along with its templates, has the ability to recognize new inserts and updates while incrementally retrieving data. It should be noted, however, that deletes are not supported through this method. If deleting capture is necessary, utilizing DMS or Debezium is recommended. However, if deletion capture is not a requirement, this option can be a cost-effective and faster alternative to performing full table scans

Conclusion

In conclusion, pulling data from JDBC using Python and PySpark can be a daunting task, especially when dealing with large datasets. However, by following the step-by-step guide outlined in this blog, users can easily and efficiently incrementally pull data from JDBC with minimal effort. By utilizing PySpark's powerful features, such as filtering and aggregation, users can extract only the necessary data and improve performance. Additionally, by incorporating a combination of primary key and last updated date, users can not only pull new data but also updates to existing records. This technique can help users stay up-to-date with their data sources and make informed decisions based on the most current information available. Overall, the ability to incrementally pull data from JDBC with Python and PySpark is an invaluable skill for any data professional, and with the right tools and techniques, it can be a straightforward and efficient process.

Chiheb Mhamdi

Data Engineer at Ancud IT with expertise in data pipelines.

1 年

I appreciate this article and encourage you to continue your excellent efforts Soumil S.

查看更多评论

要查看或添加评论，请登录

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

2025年2月27日

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

This hands-on lab demonstrates how to query S3 Table Buckets (Managed Iceberg) using Trino. The tutorial covers…

4 条评论
Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

2025年2月25日

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Introduction Managing large-scale data lakes efficiently requires advanced techniques like dual write, where data is…

1 条评论
Enhancing Query Performance with Bloom Filters in Apache Iceberg

2025年2月23日

Enhancing Query Performance with Bloom Filters in Apache Iceberg

Introduction In large-scale data processing, optimizing query performance is crucial. Apache Iceberg, a powerful table…

2 条评论
S3 Incremental File Processing with Pessimistic Locking using S3 Lock

2025年2月17日

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

What is Pessimistic Locking? Pessimistic locking is a concurrency control mechanism that prevents multiple processes…
Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

2025年2月16日

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Iceberg is a powerful table format designed for big data workloads, commonly used with Apache Spark. However, you can…

5 条评论
PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

2025年2月16日

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

PyIceberg just got a whole lot more powerful! Version 0.9.

7 条评论
Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

2025年2月14日

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Introduction In modern big data applications, managing concurrent writes to distributed storage systems like Amazon S3…

1 条评论
Leveraging S3 for Distributed Concurrency Control in Data Processing

2025年2月9日

Leveraging S3 for Distributed Concurrency Control in Data Processing

In distributed systems, managing concurrency—ensuring that only a set number of processes run in parallel—is crucial to…
Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

2025年2月8日

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

n this blog, we'll walk through creating and managing an EMR (Elastic MapReduce) cluster on EC2 to run PySpark jobs…

2 条评论
Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

2025年1月25日

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

If you have existing Iceberg tables and need to sync them with the AWS Glue Data Catalog, the iceberg-glue-syncPython…

1 条评论

See all articles

Unlocking Incremental Data in PySpark: Extracting from JDBC Sources without Debezium or AWS DMS with CDC

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Spark & AWS Glue| Data Lake(Hudi | Iceberg) Specialist | YouTuber

Authors?

Introduction?

Step 2:?Create a Table and populate the Table with Fake Data

领英推荐

Step 3: Running PySpark template which pull Incremental Data

Deep Dive into code and Logic?

Soumil S.的更多文章

社区洞察

其他会员也浏览了

Harnessing the Power of PySpark in DataBricks Delta Live Tables

How to Drop Duplicates in PySpark?

Why use Delta Live Tables in Databricks?

GroupBy #14: What it takes to be a Senior IC at Meta, Netflix Data Engineering Summit

Spark - Managers' snapshot

Understanding the PySpark

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

PySpark Introduction: Powering Big Data Processing with Apache Spark

Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

Authors?

Introduction?

Step 2:?Create a Table and populate the Table with Fake Data

领英推荐

Step 3: Running PySpark template which pull Incremental Data

Deep Dive into code and Logic?

Soumil S.的更多文章

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, and Sync with AWS Glue | Required Configuration

Enhancing Query Performance with Bloom Filters in Apache Iceberg

S3 Incremental File Processing with Pessimistic Locking using S3 Lock

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

Concurrent Writes Test for New S3 Table Buckets: Can 10 Spark Writers Performing MERGE INTO Different Partitions Handle It?

Leveraging S3 for Distributed Concurrency Control in Data Processing

Create EMR Transient Cluster, Submit PySpark Job with Async Callback, and Auto-Terminate the Cluster

Sync Existing Apache Iceberg Tables with AWS Glue Data Catalog: Run It Locally, on Airflow, or EMR with a Simple YAML-based Template

社区洞察

其他会员也浏览了

Harnessing the Power of PySpark in DataBricks Delta Live Tables

How to Drop Duplicates in PySpark?

Why use Delta Live Tables in Databricks?

GroupBy #14: What it takes to be a Senior IC at Meta, Netflix Data Engineering Summit

Spark - Managers' snapshot

Understanding the PySpark

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

PySpark Introduction: Powering Big Data Processing with Apache Spark

Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code