登录查看更多内容

Bioinformatics Workflow of RNA-Seq

Yang H.

Senior Scientist at CD Genomics

发布日期: 2020年1月20日

RNA-sequencing (RNA-seq) has a wide range of applications, and there is no optimal pipeline for all cases. We review all of the major steps in RNA-seq data analysis, including quality control, read alignment, quantification of gene and transcript levels, differential gene expression, functional profiling, and advanced analysis. They will be discussed later.

Quality control of raw reads

Quality control of RNA-seq raw reads consists of the analysis of sequence quality, GC content, adaptor content, overrepresented k-mers, and duplicated reads, dedicated to detecting sequencing errors, contaminations, and PCR artifacts. Read quality decreases towards the 3’ end of reads, bases with low quality, therefore, they should be removed to improve mappability. In addition to the quality of raw data, quality control of raw reads also includes the analysis of read alignment (read uniformity and GC content), quantification (3’ bias, biotypes, and low-counts), and reproducibility (correlation, principal component analysis, and batch effects).

Table 1. The tools for quality control of RNA-seq raw reads.

Read alignment

There are generally three strategies for read alignment, genome mapping, transcriptome mapping, and de novo assembly. Regardless of whether a genome or transcriptome reference is available, reads may map uniquely or be assigned to multiple positions in the reference, which are referred to as multi-mapped reads or multireads. Genomic multireads are generally due to repetitive sequences or shared domains of paralogous genes. Transcriptome multi-mapping arises more often due to gene isoforms. Therefore, transcript identification and quantification are important challenges for alternatively expressed genes. When a reference is not available, RNA-seq reads are assembled de novo using packages such as SOAPdenovo-Trans, Oases, Trans-ABySS, or Trinity. PE strand-specific and long-length reads are preferred since they are more informative. Emerging long-read technologies, such as PacBio SMRT sequencing and Nanopore sequencing, can generate full-length transcripts for most genes.

Figure 2. Three basic strategies for RNA-seq read mapping (Conesa et al. 2016). Abbreviations: GFF, General Feature Format; GTF, gene transfer format; RSEM, RNA-seq by Expectation Maximization.

Table 2. The comparison of genome-based and de novo assembly strategies for RNA-seq analysis.

Table 3. The public sources of RNA-seq data.

Transcript quantification

Transcript quantification can be used to estimate gene and transcript expression levels.

Table 4. The common tools for transcript quantification.

Differential expression testing

Differential expression testing is used to evaluate if one gene is differentially expressed in one condition compared to the other(s). Normalizing methods need to be adopted before comparing different samples. RPKM and TPM normalize away the most important factor, sequencing depth. TMM, DESeq, and UpperQuartile can ignore highly variable and/or highly expressed features. Other factors that interfere with intra-sample comparisons involve transcript length, positional biases in coverage, average fragment size, and GC content, which can be normalized by tools, such as DESeq, edgeR, baySeq, and NOISeq. Batch effects may still be present after normalization, which can be minimized by appropriate experimental design, or removed by methods such as COMBAT or ARSyN.

Table 5. The normalization tools for differential expression testing.

Alternative splicing analysis

Alternative splicing (AS) is a posttranscriptional process that generates different transcripts from the same gene and is vital in response to environmental stimuli by producing diverse protein products. Multiple bioinformatics tools have been developed to detect AS from experimental data. The comparison of these detection tools using RNA-seq data was conducted by Ding in 2017, and the results are shown in Table 7. They have demonstrated that TopHat and its downstream tool, FineSplice, are the fastest tools, whereas PASTA is the slowest program. Furthermore, AltEventFinder can detect the highest number of junctions, and RSR detects the lowest number of junctions. Other tools, such as TopHat, are likely to detect false positive ones. Of the two tools that detect differentially spliced isoforms, rMATS is faster than rSeqDiff but detects less differentially spliced isoforms than rSeqDiff.

Table 7. Detected AS types or differentially spliced isoforms of these tools (Ding et al. 2017).

Visualization

There are many bioinformatics tools for the visualization of RNA-seq data, including genome browsers, such as ReadXplorer, UCSC browser, Integrative Genomics Viewer (IGV), Genome Maps, Savant, tools specifically designed for RNA-seq data, such as RNAseqViewer, as well as some packages for differential gene expression analysis that enable the visualization, such as DESeq2 and DEXseq in Bioconductor. Packages, such as CummeRbund and Sashimi plots, have also been developed for visualization-exclusive purposes.

Functional profiling

The latest step in a standard transcriptomics study is generally the characterization of the molecular functions or pathways in which differentially expressed genes are involved. Gene Ontology, Bioconductor, DAVID, or Babelomics contain annotation data for most model species, which can be used for functional annotation. As for novel transcripts, protein-coding transcripts can be functionally annotated using orthology with the help of databases such as SwissProt, Pfam, and InterPro. Gene Ontology (GO) allows for some exchangeability of functional information across orthologs. Blast2GO is a popular tool that allows massive annotation of complete transcriptome against a variety of databases and controlled vocabularies. The Rfam database contains most well-characterized RNA families that can be used for functional annotation of long non-coding RNAs.

要查看或添加评论，请登录

Yang H.的更多文章

Sequencing Reveals the Beginning of Human Microbiome: The Mother-to-Infant Microbial Transmission

2021年6月28日

Sequencing Reveals the Beginning of Human Microbiome: The Mother-to-Infant Microbial Transmission

Overview An essential part of human health involves the microbiome, the complicated microbial community that resides in…
Uncovering the Roles of Rare Variants in Common Disease through Whole-Genome Sequencing

2021年5月11日

Uncovering the Roles of Rare Variants in Common Disease through Whole-Genome Sequencing

Introduction to Rare Variants Known as alternative forms of genes, rare variants are present in the genome with a minor…
The Three Major Components of the Gut Microbiome and the Application of Sequencing Technologies

2021年4月10日

The Three Major Components of the Gut Microbiome and the Application of Sequencing Technologies

With a massive and different bacterial community consisting of bacteria, archaea, viruses, and eukaryotic microbes, the…
Pan-Cancer Panels and Breakthroughs in Pan-Cancer Genome Sequencing

2021年2月9日

Pan-Cancer Panels and Breakthroughs in Pan-Cancer Genome Sequencing

Introduction to Cancer Cancer, second to the most frequent cause of death in the world, is set of diseases which have…
Bioinformatics Workflow and Tools for Ribosome Profiling

2020年3月25日

Bioinformatics Workflow and Tools for Ribosome Profiling

Overview of Ribosome Profiling Ribosome profiling (Ribo-seq) has become an established protocol to detect translated…
The Workflow of Small RNA Sequencing

2020年1月23日

The Workflow of Small RNA Sequencing

What is small RNA sequencing? Small RNA sequencing represents an increasingly popular approach to address the…
What Is Ribosome Profiling?

2020年1月16日

What Is Ribosome Profiling?

What Is Ribosome Profiling? Ribosome profiling, also known as Ribo-seq, is a recently developed high-throughput…
The Technologies and workflow of RNA-seq

2020年1月8日

The Technologies and workflow of RNA-seq

Technical advances in RNA-seq Sanger sequencing and microarrays. Sanger sequencing technology was first used for…
LncRNA Sequencing: Workflow & Data Analysis

2020年1月6日

LncRNA Sequencing: Workflow & Data Analysis

The lncRNA is a group of non-coding RNAs with a length of more than 200 nucleotides. Compared with mRNA, lncRNAs are…
Principles and Workflow of Whole Exome Sequencing

2019年12月4日

Principles and Workflow of Whole Exome Sequencing

As the development of biological experimental technology, especially gene-sequencing technology, both laboratory and…

See all articles

Bioinformatics Workflow of RNA-Seq

Yang H.

Senior Scientist at CD Genomics

Yang H.的更多文章

社区洞察

其他会员也浏览了

SPArrOw for Spatial Transcriptomics??, Alzheimer’s Knowledge Graph??, SCI-VCF: GUI for VCFs??, Diabetic Foot Ulcer Mechanisms ????

ST-CellSeg: Cell Segmentation for Spatial Transcriptomics??, SPOT for Proteomic Data??, CViewer for Shotgun Metagenomics???, AI for Cancer Treatment??

?? AlphaFold 2 Unveils Isoform Diversity ?? | Protein BLAST: Past vs Future? ?? | ANDES: Revolutionizing Gene Set Analysis ????

Open Targets' Associations on The Fly ??, eQTL Catalogue 2023??, Improved scRNA Signal Recovery ??, Human Transcription Foundation Model ??

Why are open bioinformatics pipelines so important for genomic surveillance?

Antibiotic Discovery with Explainable AI ?? Accelerated Genome Analysis with DRAGEN ?? Predicting Lives with Life-Event Sequences ??

Progeni for Target Identification ??, Single-cell gene expression predictions with scPRAM??, riboseq ??? for Ribosome Profiling

Single Molecule Delivery into Cells ?? Spatial Transcriptomics at Micron Scale ?? Raman2RNA: Predicting RNA Profiles ??

edgeR 4.0: Enhanced Sequencing Data Analysis ?? History & Strategy of Novo Nordisk ?? Code-Sharing Guide in Biology ??

AI in Genome Sequencing – Artificial Intelligence’s latest Trend Setter can Sequence Genome

Yang H.的更多文章

Sequencing Reveals the Beginning of Human Microbiome: The Mother-to-Infant Microbial Transmission

Uncovering the Roles of Rare Variants in Common Disease through Whole-Genome Sequencing

The Three Major Components of the Gut Microbiome and the Application of Sequencing Technologies

Pan-Cancer Panels and Breakthroughs in Pan-Cancer Genome Sequencing

Bioinformatics Workflow and Tools for Ribosome Profiling

The Workflow of Small RNA Sequencing

What Is Ribosome Profiling?

The Technologies and workflow of RNA-seq

LncRNA Sequencing: Workflow & Data Analysis

Principles and Workflow of Whole Exome Sequencing

社区洞察

其他会员也浏览了

SPArrOw for Spatial Transcriptomics??, Alzheimer’s Knowledge Graph??, SCI-VCF: GUI for VCFs??, Diabetic Foot Ulcer Mechanisms ????

ST-CellSeg: Cell Segmentation for Spatial Transcriptomics??, SPOT for Proteomic Data??, CViewer for Shotgun Metagenomics???, AI for Cancer Treatment??

?? AlphaFold 2 Unveils Isoform Diversity ?? | Protein BLAST: Past vs Future? ?? | ANDES: Revolutionizing Gene Set Analysis ????

Open Targets' Associations on The Fly ??, eQTL Catalogue 2023??, Improved scRNA Signal Recovery ??, Human Transcription Foundation Model ??

Why are open bioinformatics pipelines so important for genomic surveillance?

Antibiotic Discovery with Explainable AI ?? Accelerated Genome Analysis with DRAGEN ?? Predicting Lives with Life-Event Sequences ??

Progeni for Target Identification ??, Single-cell gene expression predictions with scPRAM??, riboseq ??? for Ribosome Profiling

Single Molecule Delivery into Cells ?? Spatial Transcriptomics at Micron Scale ?? Raman2RNA: Predicting RNA Profiles ??

edgeR 4.0: Enhanced Sequencing Data Analysis ?? History & Strategy of Novo Nordisk ?? Code-Sharing Guide in Biology ??

AI in Genome Sequencing – Artificial Intelligence’s latest Trend Setter can Sequence Genome