Build Your Own! A corporate Python library for Cloud and Data Engineering (Part 1/2)
Introduction
Imagine a fast-paced tech company experiencing exponential growth: headcount expands, clientele grows, and the project portfolio flourishes. This rapid growth can easily outpace the evolution of internal processes, leading to inefficiencies and duplicated efforts among developers; independently and unknowingly, they may find themselves reinventing the wheel. The result is wasted time, resources, and a patchwork of inconsistent code, hindering efficiency and complicating maintenance, thus increasing the risk of bugs and errors.
This fragmentation arises from a lack of centralized knowledge sharing and collaboration, a challenge we experienced firsthand in our Cloud and Data Engineering Team. To address this, Riccardo Rubini, PhD and I embarked on a journey to establish a shared codebase of reusable components and processes. This involved building a Python library, enforcing coding standards, and setting up a continuous integration and continuous delivery (CI/CD) pipeline in GitLab.
In this series of hands-on articles, we will delve into the process of:
By following these steps, we aim to effectively eliminate redundant code, promote consistency, and foster a more efficient and collaborative environment.
Find the mentioned snippets of code at Build Your Own! A corporate Python library for Cloud and Data Engineering (LinkedIn Article) ($3686285) · Snippets · Snippets · GitLab
Project requirements and context
There's a lot of routine in our work: connecting to an SFTP server, interacting with known platforms' APIs (e.g., Salesforce, Airship) and Cloud's SDK, or transforming data for ingestion. Initially, we considered a shared Git repository as a space for pushing code snippets, but we fell short of making this a habit. Code ended up scattered across GitHub, GitLab, Azure DevOps, and even our personal repos. And let's not even delve into the state of documentation - there was no documentation. As mentioned earlier, our processes lagged behind, and given our fast-paced environment, onboarding new hires to this system became an afterthought, with some never truly getting acquainted. Slowly but surely, we found ourselves back at square one.
While there's room for creativity, and despite the occasional flair for chaos by some who use Java, C#, or Node.js (hello, that's me ??), our Data and Cloud Engineering Team predominantly writes code in Python, marking again our starting point. For the Git repository, we chose GitLab due to its open-source nature, the CI/CD features, and our history with it.
This time, however, we took a step further: we created a single repository with a CI/CD, enforced stringent coding rules, and established disciplined branch management. Finally, to encourage widespread use across projects and clouds from our peers, we created a distributable package. With a growing team, maintaining quality demands structure; free rein was no longer an option.
Having said that, believe me when I say we could have been even more stringent. I personally toyed with the idea of using VS Code with the Devcontainer extension to make things smoother than smooth. However, not everyone's on the VS Code train (crazy, I know!), and I don't think forcing an IDE is going to win hearts, so we're not diving into evil territory just yet. ??
Solution overview
As we wish for this library to foster both intuitive use and seamless collaboration, ultimately creating a more efficient and enjoyable development experience, we strategically designed the structure of the packages inside.
For intuitive navigation, we placed domain-specific packages under the root com.mycompany. For example, from com.mycompany.math import calculator allows developers to access relevant methods such as add or substract. To achieve effective governance and collaboration, we realized that services built by Product teams deserved dedicated packages within the library (e.g., com.mycompany.geointelligence.geocoder), where they could review (or code!) the methods that interact with such services.
This is what we believe will work for us in the long run, so feel free to customize your library's structure to better suit your company's needs.
System setup
Concerning the operating system, we assume that you are utilizing Windows; however, it is entirely possible to work on Unix (or Windows Subsystem for Linux,) with minor adjustments.
As for our local environment, this is the shopping list:
Once installed, make sure to have the commands such as python in your PATH.
Then, for our cloud environments, we will need:
Finally, an essential aspect of our project lies in effective dependency management and packaging. Initially, we relied on the conventional pip and requirements.txt, alongside tools such as build, setuptools, and twine. However, we've since transitioned to using Poetry. While we may not need its full capabilities presently, we anticipate its value will become more apparent as our library and community grow; ultimately, it's your choice.
Project structure
We did some research about the industry best practices in order to avoid rookie mistakes. Py-Pkgs-Cookiecutter has been of great help to build the skeleton, not to mention Python Packages by Tomas Beuzen & Tiffany Timbers and Packaging Python. Now, when I open the library code in VS Code, I'm presented with this hierarchy:
mycompany-python-lib/
├── .gitignore
├── .gitlab-ci.yml
├── .pre-commit-config.yaml
├── .pypirc
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE
├── poetry.lock
├── pyproject.toml
├── README.md
├── requirements.txt
├── dist/
│ ├── mycompany_python_lib-0.1.0-py3-none-any.whl
│ └── mycompany_python_lib-0.1.0.tar.gz
├── docs/
│ ├── changelog.md
│ ├── conduct.md
│ ├── conf.py
│ ├── contributing.md
│ ├── index.md
│ ├── make.bat
│ ├── Makefile
│ └── requirements.txt
├── env/
│ └── ...
├── scripts/
│ └── ...
└── src/
├── data/
│ └── ...
├── main/
│ └── com/
│ └── mycompany/
│ ├── math/
│ │ ├── calculator.py
│ │ ├── ...
│ │ └── __init__.py
│ └── ...
│ ├── ...
│ └── __init__.py
└── tests/
└── com/
└── mycompany/
├── __init__.py
├── math/
│ └── test_calculator.py
└── ...
└── ...
There's a lot to unpack here, so let's dive in.
Moving to folders,
Before delving into the code, let's clarify what lies within the src/main directory.
com.mycompany is a Namespace Package. While currently serving as the only entry point, its adoption enables future extensions to categorize objects under distinct identifiers. This flexibility would also simplify the distribution of sub-packages and modules across various independent packages - if you wanted to - thereby enhancing modularity and scalability. In the provided example, math represents a root Package, housing the calculator.py Module. We could also have sub-packages such as com.mycompany.math.binary.module.py, where binary is a sub-package of math. We use this strategy, for example, with the gcp package that stores the sub-packages storage and bigquery, among others. Again, the organization of code can be tailored to suit your preferences.
Development framework
For the purpose of this article, let's pretend that our newest peer created a module calculator.py with a simple sum method, then pushed it directly to main:
def my_first_method(a,b):
together = a+b
return a + b + 42
I have a lot of issues with this code:
Since our objective is to uphold standards to guarantee quality and prevent people pushing code written right after a wild night at a rave, let's see what we can do to make sure that this module does not find its way into the main branch.
Establishing coding guidelines
As part of our design process, we carefully identified the pain points we needed to address, determining how and where to tackle them and assessing their significance. Some issues were resolved through programming, while others, such as #1, required effective branch management strategies (see Part 2/2).
Let's start with unit testing. While comprehensive tests covering edge cases and expected behavior are essential, we recognized that enforcing them on every push wouldn't be ideal, and implementing a custom check for their existence felt unnecessary as well. We believe that the branching strategy, the code review process, and the CI/CD pipeline will do the trick. This three-pronged approach ensures untested code doesn't reach the main branch, as untested sections can be flagged by code reviewers and require adding tests before merging. The CI/CD pipeline can also play a role, potentially failing builds without sufficient test coverage. So, as long as there's a test suite, issue #2 is taken care of.
What we can automate instead is the enforcement of coding style. Fellow programmers will relate to this struggle: single quotes or double quotes? Tabs or spaces? CamelCase or snake_case? What's the maximum allowable line length? Well, wonder no more, because Ruff ensures to check and format the code to comply with PEP8 (Style Guide for Python Code) and PEP257 (Python Docstring Conventions) standards, therefore significantly enhancing documentation practices as well. Indeed, Ruff can be configured to include or exclude specific rules, and we have decided to enable a check to ensure that the codebase adheres to Google-style documentation standards.
So, if I run ruff check on that single sample module, Ruff will provide feedback on how to solve issues #3 and #4, and it will also help with issue #5 as well. With ruff format we can apply safe fixes, while the others will require review.
We have another valuable tool at our disposal for addressing issue #5: MyPy. Despite Python's inherent dynamic typing, we can take advantage of type hints, therefore allowing MyPy to check the correctness of our code and, at the same time, provide documentation.
By following the feedbacks provided by these tools, we should be able to easily get to a code that looks something like this:
def sum(a: int, b: int) -> int:
"""Computes the sum of two numbers.
Args:
a (int): The first number.
b (int): The second number.
Returns:
int: The sum of `a` and `b`.
"""
return a + b
This is a code we are happy with: it works, it's tested, the style is consistent, and it's also documented, so it makes for a good development experience:
Finally, another pertinent concern that frequently arises is security. Certain routine tasks, such as API calls or SFTP connections, can pose risks if not handled correctly. It's unrealistic to assume that developers will be familiar with every potential threat, which is why we've added Bandit as a supplementary tool for identifying vulnerabilities. Consider this snippet:
url = "https://metadata.google.internal/computeMet..."
headers = {"Metadata-Flavor": "Google"}
response = requests.get(url, headers=headers)
response.raise_for_status()
While the code functions as expected, running Bandit alerts us to the absence of the timeout parameter, serving as another instance highlighting areas for enhancement within our codebase.
However, unlike Ruff, we decided that Bandit warnings do not have to halt developers from committing their code. What we do, instead, is retain this feedback as it offers valuable insights into areas for refinement in subsequent releases. Part 2/2 will delve further into this aspect.
With this baseline to build upon, the only question now is: how do we enforce adherence to these standards?
Git Hooks to the rescue!
Git Hooks are customizable scripts that Git executes on the client side before or after specific actions, such as commit or push. They allow developers to automate tasks or enforce policies at various points in the Git workflow, and we are going to use it to enforce a run of both Ruff and MyPy before the commit command. Rather than writing our own, however, we opted for the well-known pre-commit solution that comes with a good list of supported hooks right out of the box.
Look at the example below: after making modifications to a file, I proceeded to commit it, prompting Git hooks to execute:
Hence, this commit is unsuccessful and will continue to fail until I address the warnings flagged by MyPy. Once these warnings are resolved, the code will be deemed suitable for pushing to either the feature or development branch.
It's crucial to avoid going overkill with Git hooks. Git hooks should serve as tools to support developers, not drive them to the point of frustration and reluctance to contribute to the codebase. Balance is key, and you will need to find your own sweet spot, otherwise peers might be tempted to bypass Git hooks' validation simply by running git commit --no-verify, or worse, skip the CI/CD altogether by writing [skip ci] in the commit message.
Build and deploy to Google Cloud Artifact Registry
In Part 2/2, we'll delve into the specific shell commands required for deployment. However, let's first establish the necessary setup for local development.
Firstly, authenticate with the Application Default Credential or any other suitable method (e.g., Service Accounts).
> gcloud auth application-default login
Next, the Google Cloud Artifact Registry. Take note of the project ID, chosen deployment location, and repository name, then execute the commands to create the repository.
> $repository_location="europe-west8"
> $repository_name="my-python-libs"
> $project_id="my-gcp-project"
> $repository_url="https://$repository_location-python.pkg.dev/$project_id/$repository_name/"
> gcloud artifacts repositories create $repository_name --repository-format=python --location=$repository_location --description="My Python Repo"
> gcloud config set artifacts/repository $repository_name
> gcloud config set artifacts/location $repository_location
Generate your .pypirc file.
> gcloud artifacts print-settings python --project=$project_id --repository=$repository_name --location=$repository_location
Configure your Poetry target repository, then build and deploy!
> poetry self add keyrings.google-artifactregistry-auth
> poetry config repositories.gcp $repository_url
> poetry publish --build --repository gcp
How to use
Upon deployment, the library can be imported seamlessly, almost akin to any other library on PyPI. For instance, create a Google Cloud Function, then append the following lines to your requirements.txt file (my-test-libs is the name of the library specified in your pyproject.toml):
--extra-index-url https://europe-west8-python.pkg.dev/my-gcp-project/my-python-libs/simple/
my-test-libs==0.1.0
You can copy-paste this from the .pypirc file we generated earlier.
Make sure you or the Service Account have the roles needed to interact with the Artifact Registry. Then, within your main.py file, simply import and use the module as follows:
from com.mycompany.math import calculator
def my_cloud_function(request):
# some code here
print(calculator.sum(2,50))
# http resp here
That's it. We did it! ??
Conclusions
The implementation of a similar approach with CI/CD practices and pre-commit hooks by one of our larger, well-structured clients, which I found out by accident during a code review with a team member, serves as an interesting parallel to our own efforts. While their adoption doesn't guarantee our success, it does provide valuable external validation that resonates with our internal decision-making. We identified the need for improved quality control and streamlined development workflow, and seeing a similar approach implemented by a respected organization reinforces the soundness of our reasoning. Plus, this realization brings us a sense of pride as it indicates that our team's forward-thinking approach aligns with industry standards.
I believe the current library's scope will become somewhat broad, and it might be beneficial to consider splitting it into distinct packages for separate releases. This could leverage namespace packages or involve creating separate Git repositories with the same structure. I see this resulting in at least four packages in the short term: one for Google Cloud, one for Azure, one for AWS, and one for cross applications (e.g., ETL, data manipulation, data integration).
Speaking of areas for refinement, I need to mention the versioning of the library. Presently, we manage a manual X.Y.Z, where X signifies major releases, Y denotes minor updates, and Z represents patches. Transitioning to automated versioning tools like semantic-release is definitely in the roadmap.
Anyway, this initial milestone has been an enriching experience. The collaborative efforts with Riccardo have paid off, and now we're aiming to elevate this project into an official product with defined responsibilities and dedicated code maintainers, establishing a standard that will seamlessly integrate into our daily workflow.
Cheers!
Big Data Engineer | Pyspark Developer Certified @ Fincons
6 个月Complimenti ragazzi!!
Senior Data Engineer at Huware
6 个月Link to Part 2/2 → https://www.dhirubhai.net/pulse/build-your-own-corporate-python-library-cloud-data-part-rubini-f930f