Data validation class library for Scala/Spark and PySpark
In 2019, I created a Scala based class library of rules for technical data validation for Spark. What is different about the library? It was metadata based. We can specify the rules as JSON in a database or a file. The application needs to read the rules from the input, call the rule factory and apply the rules to the Spark DataFrame. Upon execution, the data is marked with True / False values depending on the value of the cell and the rule applied.
Then in 2020, I started working on a cloud migration accelerator where data validation was to be offered as a service. The accelerator has many modules and is micro-service based. As the framework is Python based, I implemented the library as PySpark classes, while keeping the sample philosophy as the Scala library. For easier management, I even packaged the Python classes into a Wheel.
Now I have two versions of the same library - one for Scala and one for Python. To reduce the maintenance headache, I wanted to follow the approach used by so many libraries I am using - that of publishing a Spark JAR file and providing a Python interface for the library.
And I have done exactly that a couple of days ago. I now have a library (Scala JAR) that I can use in Scala/Spark projects as well as Python/Spark (PySpark) projects. It should be obvious that the PySpark library is a wrapper on the Scala library that handles all the conversions between Scala and Python.