Why are open bioinformatics pipelines so important for genomic surveillance?
The GSU's bioinformatics pipelines transform raw genomic data into useful information for public health.

Why are open bioinformatics pipelines so important for genomic surveillance?

At the Genomic Surveillance Unit, we believe that making our science as accessible as possible is the best way to maximise its positive impact. That is why we are working to make our bioinformatics pipelines as open and reproducible as possible.

What is a pipeline?

Raw genomic sequence data goes in one end of a pipeline; out the other end comes analysed data which gives useful scientific answers.?

The raw data files are produced by sequencing machines, often large and expensive, which read the DNA base pairs. But the pipelines analysing this data are built from computer code and can be run on a laptop or in the cloud. Although processing large amounts of complex data through those pipelines, as we do at the Wellcome Sanger Institute , may require more powerful machines – and sometimes many of them!

Running NovaSeq 6000 sequencing machines at the Wellcome Sanger Institute. (Credit: Greg Moss / Wellcome Sanger Institute)

A series of software algorithms take the raw input data through several steps that transform it into something more usable. Some steps ‘clean’ the data, for example by filtering it or translating it into a useful format. A whole series of steps might tackle a larger problem, such as determining the species to which the genetic data belongs. There will also be steps that ensure quality, like checking that enough of the right data has been generated to answer the question. At the end of a pipeline, there may be a step to bring several outputs together in a summarised or visual format, providing clarity for end users.

Which GSU pipelines are open?

The GSU currently has three pipelines that are fully open to external users, with more in development, all accessible via GitHub. These pipelines help interpret the genomic data provided by our partners in the MalariaGEN community. They each use the workflow management system Nextflow, essentially a programming language used by bioinformaticians to build large and scalable workflows.

Two pipelines analyse the malaria parasite Plasmodium falciparum. The first pipeline runs amplicon sequencing data, small segments of the genome which can give specific information about known mutations, for example those causing drug resistance. The second pipeline runs whole-genome sequence data, using a set of reference points across the genome to answer a broader range of questions about a sample.

Blood samples in the lab containing malaria parasites.
Bioinformatics pipelines ultimately unlock the secrets kept within these blood samples, such as which malaria parasites are present and are they drug resistant? (Credit: Greg Moss / Wellcome Sanger Institute)

A third pipeline analyses two malaria-transmitting mosquito vector species, Anopheles gambiae and Anopheles funestus, reading SNPs to provide intelligence on insecticide resistance. Adapting this pipeline into Nextflow was one of several enhancements that has allowed the pipeline to tackle both species and so far process over 18,000 samples.

It’s worth noting that GSU is not the only institute opening up their pipelines. Even within the MalariaGEN community there are other open pipelines developed by the Broad Institute of MIT and Harvard .

How do we make our pipelines open?

Making bioinformatics pipelines truly open is not just about being transparent, but also being helpful. Many people publish open-source code that anyone can read, but users also need to be able to understand how the code works.?

Ideally we want these pipelines to be used in public health settings worldwide, to give quick and easy answers about a recently collected pathogen sample. The ultimate aim is for anyone to be able to install the pipeline on their machine and reproduce the exact same outputs. Here are some ways we achieve this accessibility at the GSU.

File format options: Choosing where to start

To run any bioinformatics pipeline, you first need the raw genomic data. This is generated by sequencing machines, and stored in files.

The file format for raw data most familiar to bioinformaticians is FASTQ. FASTQ files are large: they include the genome sequence data itself (lists of As, Cs, Gs and Ts); the quality, or confidence of each individual letter in the sequence; plus other information about the sequencing machine. The large size is not a problem for many people, who may only be studying a handful of samples.

However, for storing vast amounts of sequencing data, as Sanger does, a more space-efficient format is often used: CRAM (Compressed Reference-oriented Alignment Map) uses “reference based compression” to significantly reduce the amount of disk-space taken up by raw sequencing data. It is the native file format of large sequence database archives, such as the European Bioinformatics Institute.?

Other formats for raw data also exist. For example, data sent to the Sanger Institute by our MalariaGEN partners based in other countries is often received as BCL (Binary Base Call) files, the native format of the Illumina sequencing machines and an even more raw form of the data.

One of the key features of pipelines designed at GSU is that they are able to accept data in any of the above formats as a starting point, which improves their flexibility to a broader range of users.

Section of the diagrammatic overview of a pipeline showing several entry points for different file types, including .CRAM and FASTQ files.
Section of the diagrammatic overview of the GSU's Vector Variant Calling Pipeline, showing several entry points for different file types.

Documentation: How to use the pipeline

We try to be as helpful as possible by including several different types of accompanying documentation with our pipelines.

A quick start guide gets users underway setting up the pipeline on their own machine. We also include comments in the code itself, so users can quickly understand specific parts of the pipeline without having to trawl through reams of information. Graphical representations (such as those pictured here) are also a quick way to bring new users up to speed.

Finally, each pipeline includes a test data set. This allows new users to do a quick run with some data that we know works well with the pipeline, testing their installation and the code base, and hopefully reproducing identical outputs. If they can’t, then something needs to be tweaked.

Containers: Making things modular

A big part of making pipelines completely open is providing the environment in which you can run them. In coding, containers are discrete computing environments, which include not just the code needed to do the analysis but also code for the conditions required to run well. By creating containers, pipelines become much more accessible for external users because they don’t get bogged down readjusting the settings of their local machine to achieve the same outputs.

Containers also make pipelines modular. This means particular sections of the pipeline, for example the bit which detects resistance to an antimalarial drug, can be removed and the rest of the pipeline will still work. The modular section can be tweaked or have a component added, then slotted back into the pipeline.

This helps achieve what developers refer to as portability, being able to take useful sections of code and deploy them elsewhere. It also makes scaling up the GSU’s pipelines more straightforward because we can take the relevant sections and, for example, adjust them to run a higher quantity of samples.

Diagram of a bioinformatics pipeline showing containers depicted by orange boxes.
A diagram of the GSU's malaria parasite amplicon pipeline, showing various containers depicted by orange boxes.

Why are open pipelines so important?

None of these steps for making truly open pipelines are especially groundbreaking: give people an entry point they understand, explain how your pipeline works, make sure the code is easy to move around. But it all takes extra time and effort. So why bother?

Making pipelines better

The most obvious benefit of open pipelines is that many people can collaborate on them together. That makes the pipelines stronger in a number of ways.?

First, quality control – having many eyes checking your work means bugs get spotted and fixed much more quickly.?

Second, innovation and further development speeds up. Different researchers can take your pipeline, or sections of it, and add their own code and ideas which allow it to perform its functions more effectively or do new things entirely.?

Third, long-term sustainability is more likely because an open pipeline is owned by the community rather than just its creator. If the original owner neglects the pipeline others can step in to manage it and drive development forward.

A hackathon event, held jointly by MalariaGEN and the Pan-African Mosquito Control Association, where bioinformaticians could troubleshoot and share ideas in malaria vector genomic analysis. (Credit: PAMCA)

Making pipelines fairer

Making pipelines as accessible as possible means their inner workings are not controlled solely by large scientific institutions, which has previously led to such pipelines being described as “black boxes”.?

This is particularly important to create a more equitable relationship between different regions of the world, making bioinformatics pipelines accessible to users with more limited resources. In the context of infectious diseases, it is often those countries with fewer resources where pathogens pose the greatest threat.

Making pipelines with global impact

Bringing all this together, open pipelines allow the genomic surveillance of infectious diseases to have genuine global impact. Not only are many more minds able to collaborate in the same space, but that space can include users with firsthand experience of the diseases themselves.?

What’s more, making these somewhat daunting modern technologies as easy to pick up as possible encourages their adoption in industries beyond academia, for example in public health. This increases the chances of achieving real-world results: spotting the next ‘disease X’, identifying mutations of existing diseases early, and ultimately saving lives.

All the GSU's open bioinformatics pipelines can be found here on GitHub. And for more guidance on building bioinformatics pipelines for public health, check out this best practices document from Public Health Alliance for Genomic Epidemiology (PHA4GE) .


Genomic Surveillance Unit (Wellcome Sanger Institute)的更多文章

