The Conundrum of Packaging

Though I am citing the example for Python, this is valid for all programming environments.

The past year and more, I worked on an accelerator to enable movement of data from legacy environments to cloud environment (Azure, AWS and Google).

I was dealing with data validation, data profiling and data reconciliation functionality. If we look closely at these modules, we realized that these modules remain unchanged when implemented on any of the cloud platforms.

Primarily with this intent - one library to rule them all - we decided to develop functionality using Spark (Python/Spark to be precise). While we were working on specific platforms at a given point in time, the expectation was that we support as many data stores as possible, as there always is a desire to showcase the wide suitability of our accelerator.

To support multiple data stores, we adopted a class based approach to support a wide variety, covering AWS S3, ADLS Gen2, GCS, SQL Server, Teradata, Oracle, Snowflake, to name a few (actually I named many :-). We implemented a proper class hierarchy as well as a factory pattern.

To use such a structure in Python, the usual approach is to place the files either in the same directory on in a sub-directory structure. While we did use a sub-directory structure that follows the namespace our bigger problem was that the same class library was used by the above mentioned functional elements (data validation, data profiling and data reconciliation). Copying the directory structure in each would have left us with copies of the same code in multiple places, making it difficult to maintain them in the long run.

So we decided to package all the classes into a Python wheel. Using this mechanism, we create a package that can be installed on the Python environment using the pip command. To use the classes, we only need to use the appropriate import statements.

While the wheel mechanism makes it easy to deploy the code, it has a down side. The down side is when we have to identify issues with the code. If the issue lies in a class / function that is inside the wheel package, it is not possible to trace into the code inside the wheel package.

Precisely this point became the point of argument when we implemented this functionality for a customer. I was asked a question as to why I implemented the functionality using a wheel package.

I was also asked a question as to why I had packaged AWS S3 functionality when the implementation was for the Google cloud. Even in this case, people forgot that during the build cycle, we wanted to support many data stores.

And that presents us with a Hamlet like situation - to package or not to package. That is the question.

#python #package #wheel

要查看或添加评论,请登录

社区洞察

其他会员也浏览了