Build Your Own! A corporate Python library for Cloud and Data Engineering (Part 1/2)

Build Your Own! A corporate Python library for Cloud and Data Engineering (Part 1/2)

Introduction

Imagine a fast-paced tech company experiencing exponential growth: headcount expands, clientele grows, and the project portfolio flourishes. This rapid growth can easily outpace the evolution of internal processes, leading to inefficiencies and duplicated efforts among developers; independently and unknowingly, they may find themselves reinventing the wheel. The result is wasted time, resources, and a patchwork of inconsistent code, hindering efficiency and complicating maintenance, thus increasing the risk of bugs and errors.

This fragmentation arises from a lack of centralized knowledge sharing and collaboration, a challenge we experienced firsthand in our Cloud and Data Engineering Team. To address this, Riccardo Rubini, PhD and I embarked on a journey to establish a shared codebase of reusable components and processes. This involved building a Python library, enforcing coding standards, and setting up a continuous integration and continuous delivery (CI/CD) pipeline in GitLab.


In this series of hands-on articles, we will delve into the process of:

  • Building a Python library from scratch and enforcing standards (Part 1/2)
  • Implementing a CI/CD pipeline in GitLab (Part 2/2) - LINK

By following these steps, we aim to effectively eliminate redundant code, promote consistency, and foster a more efficient and collaborative environment.

Find the mentioned snippets of code at Build Your Own! A corporate Python library for Cloud and Data Engineering (LinkedIn Article) ($3686285) · Snippets · Snippets · GitLab

Project requirements and context

There's a lot of routine in our work: connecting to an SFTP server, interacting with known platforms' APIs (e.g., Salesforce, Airship) and Cloud's SDK, or transforming data for ingestion. Initially, we considered a shared Git repository as a space for pushing code snippets, but we fell short of making this a habit. Code ended up scattered across GitHub, GitLab, Azure DevOps, and even our personal repos. And let's not even delve into the state of documentation - there was no documentation. As mentioned earlier, our processes lagged behind, and given our fast-paced environment, onboarding new hires to this system became an afterthought, with some never truly getting acquainted. Slowly but surely, we found ourselves back at square one.

While there's room for creativity, and despite the occasional flair for chaos by some who use Java, C#, or Node.js (hello, that's me ??), our Data and Cloud Engineering Team predominantly writes code in Python, marking again our starting point. For the Git repository, we chose GitLab due to its open-source nature, the CI/CD features, and our history with it.

This time, however, we took a step further: we created a single repository with a CI/CD, enforced stringent coding rules, and established disciplined branch management. Finally, to encourage widespread use across projects and clouds from our peers, we created a distributable package. With a growing team, maintaining quality demands structure; free rein was no longer an option.

Having said that, believe me when I say we could have been even more stringent. I personally toyed with the idea of using VS Code with the Devcontainer extension to make things smoother than smooth. However, not everyone's on the VS Code train (crazy, I know!), and I don't think forcing an IDE is going to win hearts, so we're not diving into evil territory just yet. ??

Solution overview

As we wish for this library to foster both intuitive use and seamless collaboration, ultimately creating a more efficient and enjoyable development experience, we strategically designed the structure of the packages inside.

For intuitive navigation, we placed domain-specific packages under the root com.mycompany. For example, from com.mycompany.math import calculator allows developers to access relevant methods such as add or substract. To achieve effective governance and collaboration, we realized that services built by Product teams deserved dedicated packages within the library (e.g., com.mycompany.geointelligence.geocoder), where they could review (or code!) the methods that interact with such services.

This is what we believe will work for us in the long run, so feel free to customize your library's structure to better suit your company's needs.

System setup

Concerning the operating system, we assume that you are utilizing Windows; however, it is entirely possible to work on Unix (or Windows Subsystem for Linux,) with minor adjustments.

As for our local environment, this is the shopping list:

  • Python. We used Python 3.11, but any non-deprecated version will be fine. We will list the dependencies we need later in this article.
  • An IDE, such as VS Code. We also suggest to install some good extensions, such as Python for VS Code, Git Graph and/or GitLens, any Markdown linter, and the Google Cloud Code extension.
  • Git.
  • Google Cloud CLI.

Once installed, make sure to have the commands such as python in your PATH.

Then, for our cloud environments, we will need:

  • A GitLab account with a Git repository.
  • A Google Cloud project (the free-tier will do).

Finally, an essential aspect of our project lies in effective dependency management and packaging. Initially, we relied on the conventional pip and requirements.txt, alongside tools such as build, setuptools, and twine. However, we've since transitioned to using Poetry. While we may not need its full capabilities presently, we anticipate its value will become more apparent as our library and community grow; ultimately, it's your choice.

Project structure

We did some research about the industry best practices in order to avoid rookie mistakes. Py-Pkgs-Cookiecutter has been of great help to build the skeleton, not to mention Python Packages by Tomas Beuzen & Tiffany Timbers and Packaging Python. Now, when I open the library code in VS Code, I'm presented with this hierarchy:

mycompany-python-lib/
├── .gitignore
├── .gitlab-ci.yml
├── .pre-commit-config.yaml
├── .pypirc
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE
├── poetry.lock
├── pyproject.toml
├── README.md
├── requirements.txt
├── dist/
│   ├── mycompany_python_lib-0.1.0-py3-none-any.whl
│   └── mycompany_python_lib-0.1.0.tar.gz
├── docs/
│   ├── changelog.md
│   ├── conduct.md
│   ├── conf.py
│   ├── contributing.md
│   ├── index.md
│   ├── make.bat
│   ├── Makefile
│   └── requirements.txt
├── env/
│   └── ...
├── scripts/
│   └── ...
└── src/
    ├── data/
    │   └── ...
    ├── main/
    │   └── com/
    │       └── mycompany/
    │           ├── math/
    │           │   ├── calculator.py
    │           │   ├── ... 
    │           │   └── __init__.py
    │           └── ...
    │               ├── ...
    │               └── __init__.py
    └── tests/
        └── com/
            └── mycompany/
                ├── __init__.py
                ├── math/
                │   └── test_calculator.py
                └── ...
                    └── ...        

There's a lot to unpack here, so let's dive in.

  • .gitignore specifies intentionally untracked files that Git should ignore. It helps ensure that sensitive or unnecessary files are not included in version control.
  • .gitlab-ci.yml contains the configuration for GitLab CI/CD pipelines. It defines the stages, jobs, and commands to be executed during the continuous integration and deployment process on GitLab. We will create this file in Part 2/2.
  • .pre-commit-config.yaml configures pre-commit hooks for the project. Pre-commit hooks are scripts that run before or after some git commands to enforce coding standards, perform static analysis, or execute other checks. (find a sample in the shared snippets)
  • .pypirc is used to configure credentials for uploading packages to Python package repositories (PyPI). In our case, the Google Cloud Artifact Registry.
  • CHANGELOG.md stores a log of changes made to the project, organized by version. It provides a record of updates, bug fixes, and new features for users and developers.
  • CONTRIBUTING.md outlines guidelines and instructions for contributing to the project. It typically covers topics such as how to report issues, how to propose new features, and coding standards for contributions.
  • LICENSE is a file that contains the software license under which the project is distributed. It specifies the terms and conditions under which users are allowed to use, modify, and distribute the software.
  • poetry.lock is generated by the Poetry dependency manager and locks the dependencies to specific versions. It ensures that all installations of the project use the same versions of dependencies, providing reproducible builds.
  • pyproject.toml contains project metadata and configuration options for Poetry and other tools. It specifies dependencies, project metadata (such as name and version), and other settings. (find a sample in the shared snippets)
  • README.md serves as the main documentation and introduction to the project. It typically contains information about the project, how to install it, how to use it, and other relevant details.
  • requirements.txt lists the project dependencies in a format compatible with pip. It specifies the Python packages required to run the project, along with their versions. Since we are using Poetry for dependencies, this file is only needed to install Poetry itself.

Moving to folders,

  • The dist/ directory holds source distributions (sdist) and built distribution files (wheel), both created automatically by running poetry build unless the project configurations dictate otherwise.
  • The doc/ directory contains essential documentation alongside the necessary scripts for Sphinx. The usage of Sphinx isn't covered within this article, but you can refer to the getting-started tutorial available in the official documentation.
  • The folder env/ holds the Python virtual environment. This practice is essential for preventing dependency conflicts.
  • The directory scripts/ contains various scripts related to project automation or utility tasks, as well as custom scripts to run with pre-commit hooks.
  • The src/main/ directory is dedicated to storing the main codebase of the project. On the other end, the src/tests/ directory is designated for storing the test code corresponding to the main code. This separation ensures a clear distinction between the actual project code and its associated test suite.

Before delving into the code, let's clarify what lies within the src/main directory.

com.mycompany is a Namespace Package. While currently serving as the only entry point, its adoption enables future extensions to categorize objects under distinct identifiers. This flexibility would also simplify the distribution of sub-packages and modules across various independent packages - if you wanted to - thereby enhancing modularity and scalability. In the provided example, math represents a root Package, housing the calculator.py Module. We could also have sub-packages such as com.mycompany.math.binary.module.py, where binary is a sub-package of math. We use this strategy, for example, with the gcp package that stores the sub-packages storage and bigquery, among others. Again, the organization of code can be tailored to suit your preferences.

Development framework

For the purpose of this article, let's pretend that our newest peer created a module calculator.py with a simple sum method, then pushed it directly to main:

def my_first_method(a,b):
  together   = a+b
  return a + b + 42        

I have a lot of issues with this code:

  1. They pushed directly to main...
  2. The sum operation is incorrect and should have been caught by a test suite - but there isn't one.
  3. Unused variables clutter the code.
  4. Formatting inconsistencies.
  5. Lack of documentation hampers readability and maintainability.
  6. They pushed directly to main!!

Since our objective is to uphold standards to guarantee quality and prevent people pushing code written right after a wild night at a rave, let's see what we can do to make sure that this module does not find its way into the main branch.

Establishing coding guidelines

As part of our design process, we carefully identified the pain points we needed to address, determining how and where to tackle them and assessing their significance. Some issues were resolved through programming, while others, such as #1, required effective branch management strategies (see Part 2/2).

Let's start with unit testing. While comprehensive tests covering edge cases and expected behavior are essential, we recognized that enforcing them on every push wouldn't be ideal, and implementing a custom check for their existence felt unnecessary as well. We believe that the branching strategy, the code review process, and the CI/CD pipeline will do the trick. This three-pronged approach ensures untested code doesn't reach the main branch, as untested sections can be flagged by code reviewers and require adding tests before merging. The CI/CD pipeline can also play a role, potentially failing builds without sufficient test coverage. So, as long as there's a test suite, issue #2 is taken care of.

What we can automate instead is the enforcement of coding style. Fellow programmers will relate to this struggle: single quotes or double quotes? Tabs or spaces? CamelCase or snake_case? What's the maximum allowable line length? Well, wonder no more, because Ruff ensures to check and format the code to comply with PEP8 (Style Guide for Python Code) and PEP257 (Python Docstring Conventions) standards, therefore significantly enhancing documentation practices as well. Indeed, Ruff can be configured to include or exclude specific rules, and we have decided to enable a check to ensure that the codebase adheres to Google-style documentation standards.

So, if I run ruff check on that single sample module, Ruff will provide feedback on how to solve issues #3 and #4, and it will also help with issue #5 as well. With ruff format we can apply safe fixes, while the others will require review.

We have another valuable tool at our disposal for addressing issue #5: MyPy. Despite Python's inherent dynamic typing, we can take advantage of type hints, therefore allowing MyPy to check the correctness of our code and, at the same time, provide documentation.

By following the feedbacks provided by these tools, we should be able to easily get to a code that looks something like this:

def sum(a: int, b: int) -> int:
  """Computes the sum of two numbers.

  Args:
    a (int): The first number.
    b (int): The second number.

  Returns:
    int: The sum of `a` and `b`.
  """
  return a + b        

This is a code we are happy with: it works, it's tested, the style is consistent, and it's also documented, so it makes for a good development experience:

Finally, another pertinent concern that frequently arises is security. Certain routine tasks, such as API calls or SFTP connections, can pose risks if not handled correctly. It's unrealistic to assume that developers will be familiar with every potential threat, which is why we've added Bandit as a supplementary tool for identifying vulnerabilities. Consider this snippet:

url = "https://metadata.google.internal/computeMet..."
headers = {"Metadata-Flavor": "Google"}
response = requests.get(url, headers=headers)
response.raise_for_status()        

While the code functions as expected, running Bandit alerts us to the absence of the timeout parameter, serving as another instance highlighting areas for enhancement within our codebase.

However, unlike Ruff, we decided that Bandit warnings do not have to halt developers from committing their code. What we do, instead, is retain this feedback as it offers valuable insights into areas for refinement in subsequent releases. Part 2/2 will delve further into this aspect.

With this baseline to build upon, the only question now is: how do we enforce adherence to these standards?

Git Hooks to the rescue!

Git Hooks are customizable scripts that Git executes on the client side before or after specific actions, such as commit or push. They allow developers to automate tasks or enforce policies at various points in the Git workflow, and we are going to use it to enforce a run of both Ruff and MyPy before the commit command. Rather than writing our own, however, we opted for the well-known pre-commit solution that comes with a good list of supported hooks right out of the box.

Look at the example below: after making modifications to a file, I proceeded to commit it, prompting Git hooks to execute:

  • My pre-commit file includes formatting nuances, such as ensuring there's a newline at the end of the file. These formatting are applied automatically.
  • Similarly, Ruff enforces standards by formatting the file accordingly.
  • Lastly, MyPy notifies me about an attempt to call the sum() method without passing the expected inputs.

Hence, this commit is unsuccessful and will continue to fail until I address the warnings flagged by MyPy. Once these warnings are resolved, the code will be deemed suitable for pushing to either the feature or development branch.

It's crucial to avoid going overkill with Git hooks. Git hooks should serve as tools to support developers, not drive them to the point of frustration and reluctance to contribute to the codebase. Balance is key, and you will need to find your own sweet spot, otherwise peers might be tempted to bypass Git hooks' validation simply by running git commit --no-verify, or worse, skip the CI/CD altogether by writing [skip ci] in the commit message.

Build and deploy to Google Cloud Artifact Registry

In Part 2/2, we'll delve into the specific shell commands required for deployment. However, let's first establish the necessary setup for local development.

Firstly, authenticate with the Application Default Credential or any other suitable method (e.g., Service Accounts).

> gcloud auth application-default login        

Next, the Google Cloud Artifact Registry. Take note of the project ID, chosen deployment location, and repository name, then execute the commands to create the repository.

> $repository_location="europe-west8"
> $repository_name="my-python-libs"
> $project_id="my-gcp-project"
> $repository_url="https://$repository_location-python.pkg.dev/$project_id/$repository_name/"
> gcloud artifacts repositories create $repository_name --repository-format=python --location=$repository_location --description="My Python Repo"
> gcloud config set artifacts/repository $repository_name
> gcloud config set artifacts/location $repository_location        

Generate your .pypirc file.

> gcloud artifacts print-settings python --project=$project_id --repository=$repository_name --location=$repository_location        

Configure your Poetry target repository, then build and deploy!

> poetry self add keyrings.google-artifactregistry-auth
> poetry config repositories.gcp $repository_url
> poetry publish --build --repository gcp        

How to use

Upon deployment, the library can be imported seamlessly, almost akin to any other library on PyPI. For instance, create a Google Cloud Function, then append the following lines to your requirements.txt file (my-test-libs is the name of the library specified in your pyproject.toml):

--extra-index-url https://europe-west8-python.pkg.dev/my-gcp-project/my-python-libs/simple/

my-test-libs==0.1.0        

You can copy-paste this from the .pypirc file we generated earlier.

Make sure you or the Service Account have the roles needed to interact with the Artifact Registry. Then, within your main.py file, simply import and use the module as follows:

from com.mycompany.math import calculator

def my_cloud_function(request):
  # some code here
  print(calculator.sum(2,50))
  # http resp here        

That's it. We did it! ??

Conclusions

The implementation of a similar approach with CI/CD practices and pre-commit hooks by one of our larger, well-structured clients, which I found out by accident during a code review with a team member, serves as an interesting parallel to our own efforts. While their adoption doesn't guarantee our success, it does provide valuable external validation that resonates with our internal decision-making. We identified the need for improved quality control and streamlined development workflow, and seeing a similar approach implemented by a respected organization reinforces the soundness of our reasoning. Plus, this realization brings us a sense of pride as it indicates that our team's forward-thinking approach aligns with industry standards.

I believe the current library's scope will become somewhat broad, and it might be beneficial to consider splitting it into distinct packages for separate releases. This could leverage namespace packages or involve creating separate Git repositories with the same structure. I see this resulting in at least four packages in the short term: one for Google Cloud, one for Azure, one for AWS, and one for cross applications (e.g., ETL, data manipulation, data integration).

Speaking of areas for refinement, I need to mention the versioning of the library. Presently, we manage a manual X.Y.Z, where X signifies major releases, Y denotes minor updates, and Z represents patches. Transitioning to automated versioning tools like semantic-release is definitely in the roadmap.

Anyway, this initial milestone has been an enriching experience. The collaborative efforts with Riccardo have paid off, and now we're aiming to elevate this project into an official product with defined responsibilities and dedicated code maintainers, establishing a standard that will seamlessly integrate into our daily workflow.

Cheers!

Simone Folino

Big Data Engineer | Pyspark Developer Certified @ Fincons

6 个月

Complimenti ragazzi!!

要查看或添加评论,请登录

社区洞察