Cloud-native is the future of bioinformatics applications
It was clear from the London Bioinformatics Frontiers Conference hosted last week at The Francis Crick Institute, that the next frontier in bioinformatics is the move towards cloud-native applications.
Cloud-native applications are container-based applications and environments that work independently of the hardware platform, are more fail-proof and are managed on elastic infrastructure. By abstracting the underlying compute, storage and networking primitives, cloud-native deployments do not rely on manual infrastructure resource allocation. Rather, the orchestrator automates and handles resource allocation according to quotas set by the operators.
The technologies required to go cloud-native are already here: Docker and Singularity are the leading container technologies, Kubernetes is used to automate container orchestration, sophisticated workflow management systems, like Nextflow, are used to streamline pipeline creation and make them portable and scalable. Last but not least, the cloud has also sufficiently matured that over half of the speakers presented diagrams of their organisation’s analysis deployment stack on AWS, and even a few mentioned private cloud deployment using OpenStack.
So what are the big challenges for going full-on cloud-native? Even for organisations with early cloud deployment systems, why isn’t cloud-native used by everyone in the organisation?
The answer to this question, as identified during discussions at the conference, is that alongside moving the data into the cloud, containerisation of complex workflows and standardisation of these applications remain a big hurdle that is difficult to overcome.
Containerisation is the first step for going cloud-native
Lightweight containers, such as Docker and Singularity, allow researchers to package, distribute and run pipelines in an isolated and self-contained manner across a wide variety of multi-cloud and computing platforms. As such, they streamline software integration.
Containerising pipelines and tools is the standard required for reproducibility and portability, as we established when we published the Nextflow paper in Nature Biotechnology over two years ago. Although individual researchers at the forefront of bioinformatics and omics data analysis have embraced containerisation, as evidenced from the talks during the conference, many organisations still do not routinely implement workflow containerisation and have not yet adopted sophisticated high-level workflow management systems such as Nextflow. Instead, they are still operating with ad-hoc scripts that are far from being containerised, standardised or even properly versioned with a Git system. The lack of implementation of both containerisation and workflow management systems in some organisations was made even more apparent during the London Bioinformatics Frontiers Hackathon.
It is, therefore, safe to assume that the biggest hurdle to going cloud-native, for the majority of organisations, is being able to quickly containerise the variety of bioinformatics tools and applications and stitch them together on-the-fly into reproducible bioinformatic pipelines.
Reducing the time spent creating workflows is a must
Although workflow management systems, such as Nextflow, have massively advanced the field of bioinformatics by providing easy scaling, parallelisation and portability, issues arise when workflows become overly complex and burdensome to build and maintain. Specifically, these require users to build pipelines in which they bring together many different tools in a logical order which are run as different processes, resulting in monolithic pipelines. Monolithic pipelines represent a challenge in bioinformatics today, as altering these large pipelines to introduce new processing steps with a variety of tools, requires bioinformaticians to copy and modify pipelines by performing ad-hoc changes. Apart from assuming a high degree of specialisation, it also results in a lot of time being wasted and modularity, scalability and interoperability being severely compromised.
FlowCraft, mentioned in over half of the talks, represents a paradigm shift from building entire pipelines to building modular pipeline components - like building Lego pieces instead, that can automatically be stitched together, by FlowCraft, into dynamic workflows on-the-fly. Once each modular and containerised component is built and validated, it can be seamlessly reused and integrated into other workflows. The component-driven development, along with a powerful and flexible orchestration engine, allows users to assemble highly diverse Nextflow-based pipelines that are suited for specific needs and biological questions.
A great use case of FlowCraft in action is the INNUENDO Platform, a cross-sectional platform for the management, analysis and sharing of bacterial genomic data in routine surveillance and outbreak investigation of food-borne pathogens. INNUENDO uses FlowCraft to build pipelines generated according to available protocols. In particular, the platform makes it easy for users to simply select and assemble a predefined set of analytical modules through an intuitive user-friendly interface, as depicted below. Then, FlowCraft combines on-the-fly the dockerised modules -Lego pieces- into different Nextflow pipelines that are run with the provided inputs. INNUENDO also uses the FlowCraft web-interface to deliver pipeline process inspections and report visualisations.
INNUENDO platform: On-the-fly specification of pipeline components by users, based on FlowCraft
Standardisation of workflows fast-tracks going cloud-native
Standardisation is still a recognised challenge in bioinformatics, especially in the context of cloud-native applications, as bioinformatic analysis pipelines are often designed for execution on-premise. Importantly, workflows need to be standardised to ensure unified parallelism, portability, reproducibility and scalability. By adopting and implementing out-of-the-box standardised pipelines, the industry can overcome the challenge of embracing cloud-native applications.
Community-driven initiatives, such as nf-core, mentioned as well in over half of the talks, are leading the way for Nextflow-based workflow standardisation by ensuring, through a peer-review process, that best practices are imposed for the creation and development of bioinformatics pipelines. This ensures that standardised pipelines are portable, optimised, documented and easy-to-use out-of-the-box.
DRAGEN (Dynamic Read Analysis for GENomics), a non-containerised accelerated cloud-native application of BWA/GATK that can be accessed and run from the Illumina BaseSpace platform, the AWS marketplace and from the Lifebit CloudOS marketplace, is another example of a standardised pipeline that is used out-of-the-box by the industry. Specifically, DRAGEN resolves the issue of lengthy compute times and massive data volumes for the secondary analysis of NGS data. In terms of speed, for instance, DRAGEN can provide the secondary analysis of a whole human genome in 30x coverage in 25 minutes in the cloud, compared to close to 15 hours with a traditional CPU-based system. When running secondary analysis in the cloud, DRAGEN provides the same speed and accuracy than running on-premise, while also delivering the flexibility and scalability of the cloud.
A final word
In conclusion, I personally found these insights invaluable and a confirmation that all our efforts at Lifebit to develop a solid team of experts and/or direct contributors in all the aforementioned technologies revolutionising bioinformatics and omics data analysis are well-founded. This also reinforces our efforts and our strong belief that building CloudOS, the equivalent of GitHub for data analysis, can add immense value to the community.
As the focus of this conference was on data analysis workflow deployment, we skimmed the surface of data management and moving data into the cloud. But stay tuned for the upcoming London Bioinformatics Frontiers Conference, as I am sure the conference organising team is already on it.
Digital media and Scientific affairs at PRBB and Associate Professor at Universitat Pompeu Fabra - UPF
5 年Nice, thanks!?Dr. Maria Chatzou Dunford, will you be coming to "Advances in computational biology" in Barcelona? An all female-speakers bioinformatics conference, not to be missed! ;) Check it out here: https://ellipse.prbb.org/advances-in-computational-biology-a-conference-with-feminine-touch/
Thanks Dr. Dunford for sharing this event in London.
CEO & Founder at Lifebit
5 年Alessia Visconti?Pablo Arce Garcia?Joke Reumers?Piotr Faba?Alain Coletta?Matthew Davies?Anton Khodak?Diogo Silva?Anthony Underwood?Giuliano Maciocci?Darrol Baker?Giovanni Marco Dall'Olio?Bruno Silva?Anne Deslattes Mays?Dan Greenfield?Bruno Vieira?Tony Wildish?Pablo Prieto Barja, PhD