Pipelines speedup

Дмитрий Мишаров

DevOps Engineer at OpenSSL Software Services

发布日期: 2021年5月22日

Often when you want to run some tests you need to install required dependencies such as python packages, ruby gems, npm packages, and so on. Nowadays tests are executed automatically for pull (merge) requests in various CI systems. Dependency pulling is repeated again and again for every test run and it takes time. We cannot do much about it in cloud-based CI systems such as Travis, Circle CI, or Github Actions but there is some space for improvement for self-hosted Gitlab CI, Jenkins and others.

Cache

The first idea that would come to your mind is to enable cache. It looks reasonable. Why do need to redownload the same files again and again? Would it be better to store them in some nearly located storage? The answer is yes but there are some nuances. First of all, we should decide how our cache mechanism will be implemented. I would highlight the following ways: caching proxy, cache in the same filesystem, and the combination of these two methods.

Caching proxy

Using proxy makes sense if your organization consumes significant amount traffic from package repositories. Having caching proxy reduces the amount of incoming traffic and speeds up downloading. Besides local caching proxy can play a very important role in storing proprietary packages. There are several caching proxies and I worked with Sonatype Nexus and devpi.

Most of the time I work with Python therefore below I will provide examples for its package infrastructure. As you know Python standard package manager is pip and in order to download packages from a caching proxy you have to specify an index URL by one of the following methods:

pip.conf:

[global]
index-url = https://example.com/pypi/packages

environment variable: PIP_INDEX_URL=https://example.com/pypi/packages
command line argument: pip -i https://example.com/pypi/packages

Local cache

pip, gem, yum and others don’t download packages from the index URL immediately. Before that they check if requested packages are stored locally in predefined directories of the file system. Various package managers have different cache directories. pip in Linux stores downloads in $HOME/.cache/pip by default. And again you can change that path via pip.conf, environment variable PIP_CACHE_DIR or command line argument --cache-dir.

Cache in pipelines

That was a sort of preamble. Now we have the required knowledge in order to start a pipeline optimization. Let’s designate prerequisites:

you use Gitlab CI;
Gitlab runner is configured to use Kubernetes executor;
a pipeline has a stage for installing python packages:

test:
  image: python
  stage: tests
  before_script:
    - pip install -U pip setuptools
    - pip install -r requirements.txt
  script:
    - pytest

there is a caching proxy that has the index URL https://example.com/pypi/packages;
you have a PVC with shared access.

First of all, let’s specify caching proxy index URL in .gitlab-ci.yml:

test:
  variables:
    PIP_INDEX_URL: https://example.com/pypi/packages

Then we need to modify the Gitlab runner config file in order to mount the PVC to builder pods:

[[runners]]
  executor = "kubernetes"
  # ...
  cache_dir = "/mnt/cache"
  [runners.kubernetes]
    # ...
    [runners.kubernetes.volumes]
      # ...
      [[runners.kubernetes.volumes.pvc]]
        name = "name_of_pvc"
        mount_path = "/mnt/cache"

It’s a kind of an unobvious trick because according to the official documentation cache_dir setting is only for Shell, Docker and SSH executors. And you supposed to use an S3 backend for storing cache. I found that trick in the Kubernetes cache support issue.

And the final part is to modify the pipeline stage config:

test:
  stage: tests
  image: python
  cache:
    # that's the place where we tell Gitlab to treat that directory as a cache and
    # properly handle the content of the directory
    paths:
      - .cache/pip
  variables:
    # Change pip's cache directory to be inside the project directory since we can
    # only cache local items.
    PIP_CACHE_DIR: $CI_PROJECT_DIR/.cache/pip
  before_script:
    - pip install -U pip setuptools
    - pip install -r requirements.txt
  script:
    - pytest

Now in the pipeline log you should see something like that:

Restoring cache
Checking cache for default...
No URL provided, cache will not be downloaded from shared cache server. Instead a
local version of cache will be extracted.
Successfully extracted cache
Downloading artifacts
Running before_script and script
$ pip install -U pip setuptools
Collecting pip
  # we don't download the package but use already downloaded from prior run
  Using cached https://example.com/pypi/packages/pip-20.1.1-py2.py3-none-any.whl
...
Saving cache
Creating cache default...
.cache/pip: found 1505 matching files
No URL provided, cache will be not uploaded to shared cache server. Cache will be
stored only locally.
Created cache

In the end, we should get the following package flow:

Enabling cache in some cases can speed up to several times pipeline execution. Because not only downloaded packages are cached but compiled as well.

References:

https://blog.misharov.pro/2020-07-06/pipeline-speed-up

Pipelines speedup

Дмитрий Мишаров

DevOps Engineer at OpenSSL Software Services

Cache

Caching proxy

Local cache

Cache in pipelines

References:

更多精彩文章

社区洞察

其他会员也浏览了

Configuring the Webserver and Setting up Python environment on the Docker

Application migration to Docker containers -part 3 out of 7

Configuring Httpd Server and Setting Up Python Interpreter on Docker Container

Configuring HTTPD Webserver & Setting up Python Interpreter and running Python Code on Docker Container

Building a Python scalable Flask application using docker-compose and Nginx load balancer

Why Docker

Configure HTTPD Server and Python Interpreter on Docker Container

Integrating Kubernetes with Python!

Configuring web server and setting up python interpreter on the top of docker container

Setting and Getting Environment Variables in Python

Cache

Caching proxy

Local cache

Cache in pipelines

References:

How I backported Podman 4 to Ubuntu 22.04

2023年2月6日

Home-made remote video rendering

2022年9月29日

Debugging system python scripts

2022年8月30日

Zsh in Debian

2022年7月15日

Minimalistic Selenium container image

2021年10月3日

GitLab custom executor for Openstack

2021年6月18日

Gitlab CI shared pipelines

2021年6月2日

UI testing in Kubernetes

2021年5月26日

Building and deploying static websites using Openshift, S2I and Gitlab CI

2021年5月24日

Open UI Automation

2021年5月20日

社区洞察

其他会员也浏览了

Configuring the Webserver and Setting up Python environment on the Docker

Application migration to Docker containers -part 3 out of 7

Configuring Httpd Server and Setting Up Python Interpreter on Docker Container

Configuring HTTPD Webserver & Setting up Python Interpreter and running Python Code on Docker Container

Building a Python scalable Flask application using docker-compose and Nginx load balancer

Why Docker

Configure HTTPD Server and Python Interpreter on Docker Container

Integrating Kubernetes with Python!

Configuring web server and setting up python interpreter on the top of docker container

Setting and Getting Environment Variables in Python