Developing a reproducible, scalable, and shareable pipeline for alternative splicing analysis using Nextflow
nf-core/rnasplice is a bioinformatics pipeline for alternative splicing analysis of RNA sequencing data

Developing a reproducible, scalable, and shareable pipeline for alternative splicing analysis using Nextflow

Benjamin Southgate is a senior bioinformatician at Zifo with a background in clinical medicine and cancer genomics. He led the Nextflow development team which included James Ashmore, Valentino Ruggieri, Claire Prince, Keerthana Bhaskaran, Asma Ali, and Lathika Mohan. In this article, Benjamin will describe the team’s experience using Nextflow and the nf-core framework to develop a best-practice pipeline for alternative splicing analysis.


Alternative splicing refers to the process by which different combinations of exons within a pre-mRNA transcript are selected and joined together, resulting in the production of multiple mRNA isoforms from a single gene. Studying this process is important because it plays a significant role in biological processes like?tissue development?and response to?environmental cues, and its dysregulation has been linked to a range of?neurological?and?muscular disorders, as well as?cancer. Nonetheless, the analysis of alternative splicing events using RNA-seq data can be quite challenging, particularly when faced with multiple analysis packages and the absence of a best-practice pipeline.

No alt text provided for this image
Alternative splicing is a process during gene expression that allows a single gene to code for multiple proteins. https://en.wikipedia.org/wiki/Alternative_splicing

Excitingly, bioinformatics has experienced a surge of vibrant and collaborative initiatives aimed at establishing standardized protocols for pipeline development. This collective endeavour has converged in the form of the?nf-core?initiative - a devoted community focused on curating and creating reproducible, standardized pipelines using the domain specific language?Nextflow. Given the pressing demand for an alternative splicing pipeline, we felt the optimal choice for development would be Nextflow and the nf-core framework. In this blog post, I will delve into the valuable lessons the team learned and the challenges we encountered throughout the development journey.

Getting Started Is Easier with Friends

Developing a pipeline with Nextflow had the distinct advantage of receiving support and resources from the nf-core community. This community, consisting of bioinformatics researchers and developers, created a helpful and inclusive environment that fostered collaboration and facilitated the start of the development process. The nf-core toolkit, which includes a starting template, a suite of publicly available and heavily tested modules and subworkflows, as well as thorough documentation and best practices, served as invaluable resources that facilitated the development process.

No alt text provided for this image
The nf-core framework for communtiy-curated bioinformatics pipelines. https://doi.org/10.1038/s41587-020-0439-x

The well-established structure and conventions of nf-core pipelines helped the team to efficiently organize and structure the pipeline, ensuring reproducibility and ease of use for the wider research and developer community. The availability of already developed modules, including some of the most popular bioinformatics tools such as FastQC, STAR, and Salmon, made it seamless to add common tools to the workflow. The active and supportive nf-core community also provided timely feedback, suggestions, and bug fixes through the nf-core Slack channel, which greatly aided in refining the pipeline. Overall, the nf-core community and tools played a crucial role in the successful initiation and development, making it a valuable asset in our bioinformatics work.

Ups, Downs, and In-Betweens

Throughout the development journey, the advantages of using Nextflow for building the pipeline became increasingly evident to the team. A Nextflow pipeline, following the nf-core framework, offers a seamless solution for tackling the myriad challenges that often arise during pipeline development: managing missing dependencies, allocating adequate resources, and continuous integration testing.

Nextflow and nf-core also greatly enhances reproducibility and transferability, simplifying the process of sharing and publishing pipelines. With the guidelines and tools developed by the nf-core community, the development of the pipeline became a streamlined, efficient, and scalable endeavour. The pipeline's modular structure, coupled with the flexibility of Nextflow, allowed us to effortlessly incorporate various tools and methodologies for alternative splicing analysis, tailoring the pipeline to suit our specific research requirements. Furthermore, the implementation of containerization through Docker and Singularity ensured the reproducibility and portability of the pipeline across diverse computing environments.

The development process was not without its share of difficulties. One notable challenge we encountered was grappling with the language itself. Despite its inherent power and flexibility as a workflow management system, mastering the syntax and grasping the concepts proved to be somewhat of a steep learning curve for the team. Understanding process definition, channel management, operator usage, and data input/output required careful attention to detail. Furthermore, troubleshooting and accessing up-to-date documentation for Nextflow posed its own set of challenges, given the ever-evolving nature of the language. We’d often have to dive into the nf-core Slack channel to clarify language features. The aspect that caused us the most trouble was the interaction between Nextflow and the Groovy language. Although the documentation suggests that they are interchangeable, in reality, it was often unclear when Groovy methods could be used, and the error messages were not always intuitive. However, with perseverance and support from the Nextflow community, we were able to overcome these challenges and successfully implement the envisioned pipeline.

RNASplice – A Quick Look

The rnasplice pipeline performs a number of steps essential for conducting alternative splicing analysis. These include quality control, trimming, alignment, quantification, differential exon usage, differential transcript usage, and event-based splicing detection. Moreover, it generates coverage tracks and a comprehensive array of plots to thoroughly examine and scrutinize the differential results.?

The pipeline is designed to be reproducible, scalable, and adaptable to different analysis methods and can be run on cloud platforms like AWS Batch or through Nextflow Tower for efficient and automated execution. The final development schematic that we produced can be seen below.

No alt text provided for this image
Schematic of nf-core/rnasplice bioinformatics pipeline for alternative splicing analysis of RNA sequencing data

Before we started developing the pipeline, we had one big question: which differential splicing software should we include? To find the answer, we conducted a literature review on available software and how well they performed in?benchmarking studies. We also asked the nf-core community for their input, and we got a lot of helpful recommendations. After some discussion, we decided it would be best to include software across a variety of analysis methods. These methods covered differential exon/transcript usage and expression, plus differential event-based splicing. By including all of these methods, we hope to give users a wide range of analysis options that meet their needs and preferences.

In the development of the pipeline, a crucial factor we considered was its ability to accommodate diverse types of input data. We recognized that users might have already completed the fundamental pre-processing of their RNA-seq data and wanted to avoid duplicating those steps. Consequently, we ensured that the pipeline could be initiated at various intermediate stages, allowing users to start from FASTQ files, genome or transcriptome BAM files, or even existing Salmon quantification output.

Throughout the development process, we actively engaged with the nf-core community to gather feedback on our progress. One valuable recommendation we received was to incorporate visualizations into the pipeline. When it comes to analyzing alternative splicing, a picture is worth more than words, as it facilitates a better understanding of the biological results. As a result, we expanded the functionality to include plotting of differential results. To achieve this, we leveraged the capabilities of the?DEXSeq,?edgeR, and?MISO software, which enabled us to generate informative sashimi plots, among other visualizations.

In conclusion, the adoption of the nf-core framework greatly facilitated the development, reproducibility, and scalability of the rnasplice pipeline. The combination of Nextflow's flexibility, containerization, collaborative community, and additional features provided by Nextflow Tower, has made the rnasplice pipeline a powerful tool for researchers in the field of bioinformatics and RNA-seq analysis. We hope that this pipeline will contribute to advancing the understanding of alternative splicing regulation and its implications in various biological processes and diseases.

If you're interested in learning more about the rnasplice pipeline, you can find the pipeline code and documentation on the nf-core website. Furthermore, for customers seeking to leverage our bioinformatics expertise in creating their own Nextflow pipeline, feel free to contact us directly at?[email protected]. We look forward to hearing from you!

Acknowledgements

Thank you to all the nf-core community who helped us over the course of this development project. In particular, we would like to thank Harshil Patel, Maxime Garcia, Phil Ewels, and Gisela Gabernet for their technical advice and guidance.

Congratulations on the milestone! ??

回复

要查看或添加评论,请登录

Zifo Bioinformatics的更多文章

社区洞察

其他会员也浏览了