Data validation class library for Scala/Spark and PySpark

In 2019, I created a Scala based class library of rules for technical data validation for Spark. What is different about the library? It was metadata based. We can specify the rules as JSON in a database or a file. The application needs to read the rules from the input, call the rule factory and apply the rules to the Spark DataFrame. Upon execution, the data is marked with True / False values depending on the value of the cell and the rule applied.

Then in 2020, I started working on a cloud migration accelerator where data validation was to be offered as a service. The accelerator has many modules and is micro-service based. As the framework is Python based, I implemented the library as PySpark classes, while keeping the sample philosophy as the Scala library. For easier management, I even packaged the Python classes into a Wheel.

Now I have two versions of the same library - one for Scala and one for Python. To reduce the maintenance headache, I wanted to follow the approach used by so many libraries I am using - that of publishing a Spark JAR file and providing a Python interface for the library.

And I have done exactly that a couple of days ago. I now have a library (Scala JAR) that I can use in Scala/Spark projects as well as Python/Spark (PySpark) projects. It should be obvious that the PySpark library is a wrapper on the Scala library that handles all the conversions between Scala and Python.

#datavalidation #spark #scala #python #library #pyspark #dqus

要查看或添加评论,请登录

Bipin Patwardhan的更多文章

  • Parallel execution in Spark

    Parallel execution in Spark

    On reading the title, I am sure the first reaction will be 'What am I talking about'. As we all know, Spark is a…

    1 条评论
  • Writing code to generate code - Python + SQL version

    Writing code to generate code - Python + SQL version

    In my current project, we had to build multiple metric tables. The base table had 50 columns and we had to add around…

  • Change management is crucial (Databricks version)

    Change management is crucial (Databricks version)

    My last project was a data platform implemented using Databricks. As is standard in a data project, we were ingesting…

  • Friday fun - Impersonation (in a good way)

    Friday fun - Impersonation (in a good way)

    All of us know that impersonation - the assumption of another person's identity, be it for good or bad - is not a good…

  • Any design is a trade-off

    Any design is a trade-off

    Irrespective of any area in the world (software or otherwise), every design is a trade off. A design cannot be the 'one…

    1 条评论
  • Quick Tip: The headache caused by import statements in Python

    Quick Tip: The headache caused by import statements in Python

    When developing applications, there has to be a method to the madness. Just because a programming environment allows…

  • Databricks: Enabling safety in utility jobs

    Databricks: Enabling safety in utility jobs

    I am working on a project where we are using Databricks on the WAS platform. It is a standard data engineering project…

  • A Simple Code Generator Using a Cool Python Feature

    A Simple Code Generator Using a Cool Python Feature

    For a project that I executed about three years ago, I wrote a couple of code generators - three variants of a…

  • Recap of my articles from 2024

    Recap of my articles from 2024

    As we are nearing the end of 2024, I take this opportunity to post a recap of the year - in terms of the articles I…

  • Handling dates

    Handling dates

    Handling dates is tough in real life. Date handling is probably tougher in the data engineering world.

社区洞察

其他会员也浏览了