登录查看更多内容

Data validation class library for Scala/Spark and PySpark

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

发布日期: 2022年8月28日

In 2019, I created a Scala based class library of rules for technical data validation for Spark. What is different about the library? It was metadata based. We can specify the rules as JSON in a database or a file. The application needs to read the rules from the input, call the rule factory and apply the rules to the Spark DataFrame. Upon execution, the data is marked with True / False values depending on the value of the cell and the rule applied.

Then in 2020, I started working on a cloud migration accelerator where data validation was to be offered as a service. The accelerator has many modules and is micro-service based. As the framework is Python based, I implemented the library as PySpark classes, while keeping the sample philosophy as the Scala library. For easier management, I even packaged the Python classes into a Wheel.

Now I have two versions of the same library - one for Scala and one for Python. To reduce the maintenance headache, I wanted to follow the approach used by so many libraries I am using - that of publishing a Spark JAR file and providing a Python interface for the library.

And I have done exactly that a couple of days ago. I now have a library (Scala JAR) that I can use in Scala/Spark projects as well as Python/Spark (PySpark) projects. It should be obvious that the PySpark library is a wrapper on the Scala library that handles all the conversions between Scala and Python.

#datavalidation #spark #scala #python #library #pyspark #dqus

要查看或添加评论，请登录

Bipin Patwardhan的更多文章

Parallel execution in Spark

2025年3月22日

Parallel execution in Spark

On reading the title, I am sure the first reaction will be 'What am I talking about'. As we all know, Spark is a…

1 条评论
Writing code to generate code - Python + SQL version

2025年3月6日

Writing code to generate code - Python + SQL version

In my current project, we had to build multiple metric tables. The base table had 50 columns and we had to add around…
Change management is crucial (Databricks version)

2025年2月22日

Change management is crucial (Databricks version)

My last project was a data platform implemented using Databricks. As is standard in a data project, we were ingesting…
Friday fun - Impersonation (in a good way)

2025年2月14日

Friday fun - Impersonation (in a good way)

All of us know that impersonation - the assumption of another person's identity, be it for good or bad - is not a good…
Any design is a trade-off

2025年2月3日

Any design is a trade-off

Irrespective of any area in the world (software or otherwise), every design is a trade off. A design cannot be the 'one…

1 条评论
Quick Tip: The headache caused by import statements in Python

2025年1月22日

Quick Tip: The headache caused by import statements in Python

When developing applications, there has to be a method to the madness. Just because a programming environment allows…
Databricks: Enabling safety in utility jobs

2025年1月13日

Databricks: Enabling safety in utility jobs

I am working on a project where we are using Databricks on the WAS platform. It is a standard data engineering project…
A Simple Code Generator Using a Cool Python Feature

2025年1月2日

A Simple Code Generator Using a Cool Python Feature

For a project that I executed about three years ago, I wrote a couple of code generators - three variants of a…
Recap of my articles from 2024

2024年12月17日

Recap of my articles from 2024

As we are nearing the end of 2024, I take this opportunity to post a recap of the year - in terms of the articles I…
Handling dates

2024年12月9日

Handling dates

Handling dates is tough in real life. Date handling is probably tougher in the data engineering world.

See all articles

Data validation class library for Scala/Spark and PySpark

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

Bipin Patwardhan的更多文章

社区洞察

其他会员也浏览了

Automating Weather Data Processing with Airflow, Docker, and Python

Pandas - Create DataFrame

Playing around with Spark and Spark ML

Why UDFs (User Defined Functions) is slow

DataFrames vs. Datasets in Spark

Python vs C# in Azure Databricks

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Spark Tidbits - Lesson 1

Query Amazon S3 Table Buckets with AWS Lambda Function: Python, No Spark

Code Snippet Store and Retrieve Python Objects on Amazon S3

Bipin Patwardhan的更多文章

Parallel execution in Spark

Writing code to generate code - Python + SQL version

Change management is crucial (Databricks version)

Friday fun - Impersonation (in a good way)

Any design is a trade-off

Quick Tip: The headache caused by import statements in Python

Databricks: Enabling safety in utility jobs

A Simple Code Generator Using a Cool Python Feature

Recap of my articles from 2024

Handling dates

社区洞察

其他会员也浏览了

Automating Weather Data Processing with Airflow, Docker, and Python

Pandas - Create DataFrame

Playing around with Spark and Spark ML

Why UDFs (User Defined Functions) is slow

DataFrames vs. Datasets in Spark

Python vs C# in Azure Databricks

Build Your Iceberg Table with Python—No Spark! | Insert, Overwrite, UPSERT & Delete | Hands-On Guide with S3 & Glue Hive Metastore Query Athena/DuckDB

Spark Tidbits - Lesson 1

Query Amazon S3 Table Buckets with AWS Lambda Function: Python, No Spark

Code Snippet Store and Retrieve Python Objects on Amazon S3