Pipelines speedup
Often when you want to run some tests you need to install required dependencies such as python packages, ruby gems, npm packages, and so on. Nowadays tests are executed automatically for pull (merge) requests in various CI systems. Dependency pulling is repeated again and again for every test run and it takes time. We cannot do much about it in cloud-based CI systems such as Travis, Circle CI, or Github Actions but there is some space for improvement for self-hosted Gitlab CI, Jenkins and others.
Cache
The first idea that would come to your mind is to enable cache. It looks reasonable. Why do need to redownload the same files again and again? Would it be better to store them in some nearly located storage? The answer is yes but there are some nuances. First of all, we should decide how our cache mechanism will be implemented. I would highlight the following ways: caching proxy, cache in the same filesystem, and the combination of these two methods.
Caching proxy
Using proxy makes sense if your organization consumes significant amount traffic from package repositories. Having caching proxy reduces the amount of incoming traffic and speeds up downloading. Besides local caching proxy can play a very important role in storing proprietary packages. There are several caching proxies and I worked with Sonatype Nexus and devpi.
Most of the time I work with Python therefore below I will provide examples for its package infrastructure. As you know Python standard package manager is pip and in order to download packages from a caching proxy you have to specify an index URL by one of the following methods:
- pip.conf:
[global] index-url = https://example.com/pypi/packages
- environment variable: PIP_INDEX_URL=https://example.com/pypi/packages
- command line argument: pip -i https://example.com/pypi/packages
Local cache
pip, gem, yum and others don’t download packages from the index URL immediately. Before that they check if requested packages are stored locally in predefined directories of the file system. Various package managers have different cache directories. pip in Linux stores downloads in $HOME/.cache/pip by default. And again you can change that path via pip.conf, environment variable PIP_CACHE_DIR or command line argument --cache-dir.
Cache in pipelines
That was a sort of preamble. Now we have the required knowledge in order to start a pipeline optimization. Let’s designate prerequisites:
- you use Gitlab CI;
- Gitlab runner is configured to use Kubernetes executor;
- a pipeline has a stage for installing python packages:
test: image: python stage: tests before_script: - pip install -U pip setuptools - pip install -r requirements.txt script: - pytest
- there is a caching proxy that has the index URL https://example.com/pypi/packages;
- you have a PVC with shared access.
First of all, let’s specify caching proxy index URL in .gitlab-ci.yml:
test: variables: PIP_INDEX_URL: https://example.com/pypi/packages
Then we need to modify the Gitlab runner config file in order to mount the PVC to builder pods:
[[runners]] executor = "kubernetes" # ... cache_dir = "/mnt/cache" [runners.kubernetes] # ... [runners.kubernetes.volumes] # ... [[runners.kubernetes.volumes.pvc]] name = "name_of_pvc" mount_path = "/mnt/cache"
It’s a kind of an unobvious trick because according to the official documentation cache_dir setting is only for Shell, Docker and SSH executors. And you supposed to use an S3 backend for storing cache. I found that trick in the Kubernetes cache support issue.
And the final part is to modify the pipeline stage config:
test: stage: tests image: python cache: # that's the place where we tell Gitlab to treat that directory as a cache and # properly handle the content of the directory paths: - .cache/pip variables: # Change pip's cache directory to be inside the project directory since we can # only cache local items. PIP_CACHE_DIR: $CI_PROJECT_DIR/.cache/pip before_script: - pip install -U pip setuptools - pip install -r requirements.txt script: - pytest
Now in the pipeline log you should see something like that:
Restoring cache Checking cache for default... No URL provided, cache will not be downloaded from shared cache server. Instead a local version of cache will be extracted. Successfully extracted cache Downloading artifacts Running before_script and script $ pip install -U pip setuptools Collecting pip # we don't download the package but use already downloaded from prior run Using cached https://example.com/pypi/packages/pip-20.1.1-py2.py3-none-any.whl ... Saving cache Creating cache default... .cache/pip: found 1505 matching files No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally. Created cache
In the end, we should get the following package flow:
Enabling cache in some cases can speed up to several times pipeline execution. Because not only downloaded packages are cached but compiled as well.