Marvelous MLOps #21: CI/CD for MLOps on GitLab (part?1)
Code your way to your first CI pipeline
This article explains the need for CI pipelines as a part of CI/CD practices. First I’ll share my thoughts on why they are so useful and what their added value is. Then I’ll show you how to build your first simple CI pipeline using Gitlab.
Why I love CI so much, and why you should?too
CI pipelines are a key step, and usually the first step, in your automated deployment. And holy popcorn Batman, are they great! Once you go CI, you’ll never go back. You will ask yourself “how could I ever work without this?”. Let me break down the advantages for you:
1. Consistency: Automation ensures consistent execution, avoiding human errors.
2. Speed: Automation speeds up the ML lifecycle. Automated pipelines for preprocessing, model (re)training, testing, and deployment are much faster than manual processes. Eventually this will leave more time and headspace for the creative solutions that add value.
3. Scalability: Automation makes it easier to scale processes up or down as needed, without significant manual effort. You also want to build well designed pipelines where you can just adjust the configuration parameters and voilà, your ML runs at scale!
4. Version Control: Automated pipelines integrate with version control systems, making it easier to track changes, collaborate with team members, and roll back to previous versions. You can connect your pipelines to all kinds of version control conditions, on a push, on a certain branch, on a certain merge request, etc. Customization here is only limited by your creativity, the sky’s the limit!
5. Reproducibility: Automated pipelines record all the steps and parameters used during model deployment. Together with a data snapshot, this makes it possible to recreate the exact same model. This can be crucial for fixing problems and in some cases auditing.
6. Testing, Quality Assurance and Validation: Automated pipelines come with comprehensive testing and validation steps. This helps catch issues early in the development process, ensuring high-quality. You can write your own tests or use existing test protocols. For example in the form of pre-commit hooks. Check out my article on pre-commit hooks .
Are you convinced yet?
Okay chill Tom Cruise, no need to shout. Let’s check out the code.
The code
GitLab CI/CD is a powerful tool that allows you to build your own customised CI/CD pipelines. In this part, we’ll be building a CI pipeline. I will walk you through a simple GitLab CI configuration file for a Python project, focusing on its various stages and jobs. A GitLab CI configuration file usually lives in the root of your repository as?.gitlab-ci.yml. It is a YAML file that defines your GitLab pipeline. For more information please see the documentation on Gitlab CI/CD .
In this ML project repository I’ve included a GitLab CI configuration file. Its mere presence will create and trigger a CI Pipeline on every code push (since I haven’t defined any other conditional statements for triggering). You can find the full file CI configuration file here or check out its full code contents at the bottom of this article.
领英推荐
I do not want you to look at the actual ML python code too much! That is why it is just one preprocessing function with its unit test. We will build a full project repo with all bells and whistles in a future article. For now we’ll just focus on the CI.
Let me explain the CI Configuration File step by step alternating snippets of code with explanations. Mind you all these snippets should actually be concatenated together in one YAML file! You can find the full file at the end of the article or in the repository.
The start of our CI configuration
image: python:3.11
stages:
- test
- package
- docker
services:
- docker:20.10.17-dind
The start of the file defines the building blocks that we are going to use in our pipeline. There are some optional ones and some mandatory ones. Every pipeline runs in a container! The rest is up to you and your use case.
image: python:3.11: This line specifies the base Docker image to be used for the CI/CD pipeline’s environment. In this case, it is a linux distribution (the standard on GitLab) with Python 3.11. This will be our runtime environment.
stages: This section defines the different stages of the pipeline. We have three stages: test, package, and docker. Each stage represents a phase in the development and deployment process. There are some different conventions for this, but please organise it in a way that works for you and your teams! As we say in Dutch “it’s your party!”.
services: Here, you can specify any additional services needed during the CI/CD process. In this case, we will be using Docker as a service with version 20.10.17-dind (Docker in Docker). This allows us to build Docker images within our CI/CD pipeline, which we will want to do in the last job of the Docker stage.
The jobs in the pipeline run in sequence (depending on your GitLab CI configuration they could also run in parallel within a stage, but we are not going to get into that for now). If a job fails, the pipeline will stop running and be returned as a “failed” pipeline. If all jobs are succesful the pipeline will have “passed”.
Note that before each job that requires pip I like to upgrade pip. Upgrading pip is important because it ensures that you have access to the latest features and bug fixes. Additionally, upgrading pip can help you avoid compatibility issues with other packages and dependencies. This will make your pipeline more robust! ????
Now, let’s dive into the individual jobs within each stage.