登录查看更多内容

Overview of Bulk RNA-Seq on T-Bioinfo Server

Sonalika Ray

?? Bioinformatics Research Consultant | IPA Certified | PhD Scholar

发布日期: 2023年10月26日

RNA-Seq technology provides insights into how cells & tissues function by measuring the levels of gene expression. Since all normal cells within an organism possess the same genome, the differences in cell identities and functions are determined by gene expression. Bulk RNA-Seq (whole transcriptome sequencing) experiments produce a view of gene expression of an entire sample. However, they do not differentiate among cell types within the sample, rather they give a view of gene expression within a whole organ or tissue type. This method has been instrumental in the development of many single-cell RNA sequencing methods.?

Bulk RNA-Seq data analyses consists of the following key steps:?

Quality check and preprocessing of raw sequence reads: PCR Clean & Trimmomatic?
Mapping reads to a reference genome or transcriptome: Bowtie-2
Counting reads mapped to individual genes or transcripts: RSemExpTable?
Identification of differential expression: DeSeq2

Now let's have a look at an example Bulk RNA Seq pipeline on the T-Bioinfo Server: https://server.t-bio.info/pipelines/3931418?time=1653977840 and learn about the types of input files that should be uploaded, parameters chosen to run the pipeline, processing pipeline and finally what the output files look like.?

Input files required for processing the pipeline

To run the Bulk RNA-Seq pipeline, following are the optional parameters and type of input files that could be uploaded.?

Preliminary Parameters

Various options are available to set the parameters as listed:?

Type of Reference Genome: RModelGenomeGTF, RGenome, RGenomeGTF
Format of NGS Data: fastQ, fasta, SAM_BAM, SimulateFQ, SimulateFA
Single or Pair end reads: SE, PE
Organism: To select reference genome
File Upload

To upload the input files, a user can upload the input file to run the pipeline in various formats as mentioned below:

The “txt” files can be uploaded directly under “Upload Files” option, or
Files could be uploaded from a “Link”, or
We can also upload the “NCBI Run Table” file, or
Upload “.txt”or”.svl” file to bulk import URLs

Group Selection

Next, we need to put the samples together based on the specific group that they belong to, as in the demo pipeline that we are considering as an example here, we will put the file under one group and proceed ahead.

Contrasts

Contrast is used to represent categorical IVs (factors) in modeling. In particular, it is used to recode a factor into a set of "contrast variables". We will select all and then finally submit all changes to run the pipeline.

Steps for Processing Pipeline

RNA-Seq analysis on the T-Bioinfo Server (Generating Gene Expression Table from raw reads)

To run the pipeline we need to follow the following workflow:

Start > PCR Clean > Trimmomatic > Bowtie-2t > RsemExpTable > DeSeq2

Lets now understand the functionality of each step in the pipeline.?

Start: RNA-Seq analysis pipeline starts with a job called “Start” that compiles user selected data input options into a series of tags and generates the correct pipeline options, reducing the number of possible algorithms to the ones that can handle the input data.

Based on the highlighted buttons, now you can create your pipeline using the graphical interface. By right-clicking the selected button, you will be able to deselect it.

Some buttons will open a parameters dialog box. After selecting all the desired options, select the “end” button to give the pipeline a name, upload data and run the pipeline.

领英推荐

Unlocking the intricate tapestry: Whole Genome…

Tim Sandle, Ph.D., CBiol, FIScT 11 个月前

Increasing intracellular dNTP levels improves prime…

TriLink BioTechnologies, part of Maravai LifeSciences 4 个月前

How to Do "Whole Genome DNA Methylation Sequencing…

ALLSHENG 1 年前

Trimmomatic:

Trimmomatic algorithm cleans raw sequencing reads from technical adapters. The Trimmomatic pre-processing step is usually performed to ensure better quality of alignment of reads on the reference genome.

?PCR Clean: PCR Clean module cleans all duplicate reads from raw sequencing data. The presence of duplicate reads from polymerase chain reaction (PCR) amplification can distort estimates of gene expression levels and so the duplicated reads should be eliminated prior to processing the data. Input formats for the module are fastQ or fastA raw sequencing reads. After cleaning PCR duplicates, the output is given in the same format as input (fastQ or fastA). For more info on PCR amplification please see: https://www.ncbi.nlm.nih.gov/probe/docs/techpcr/

Bowtie2: Bowtie2? is a fast alignment algorithm that is based on the “seed” (or k-mer) approach. “Seed” substrings from the read and their reverse complements are extracted and aligned to the reference in an ungapped fashion. Then their positions on the reference are recorded and they are extended into full alignments using SIMD-accelerated dynamic programming

?For more info please see: Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357-359. doi:10.1038/nmeth.1923.?

RSem: RSEM is an algorithm that quantifies transcript abundances based on the alignment file (sam file). RSEM will give out two files: gene expression and isoform expression. The counts are in FPKM (stands for Fragments Per Kilobase of transcript per Million mapped reads). We can choose in which format we want our Rsem Expression Type to be in “FPKM”, “expected_count”, “TPM”. For more info please see: Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. doi:10.1186/1471-2105-12-323.

DeSeq2: A common difficulty in the analysis of NGS data is the strong variance of LFC (logarithmic fold change) estimates for genes with low read count. DESeq2 overcomes this issue by shrinking LFC estimates toward zero in a manner such that shrinkage is stronger when the available information for a gene is low, which may be because counts are low, dispersion is high or there are few degrees of freedom.

Output Files: Obtained when pipeline processing is complete

After the pipeline has completed its processing, you will obtain a list of output files that could be downloaded to carry out statistical analysis and interpret biological insights. You will also obtain data visualizations in your output files that make sense to understand meaningful patterns or significant results.?

Interactive dashboard - RNA-Seq Analysis on T-Bioinfo Server

PDX_1_expression_isoforms.txt: This file lists the gene expression values for isoforms of each gene in all the samples.

FastQC.zip: FastQC provides a simple way to carry out quality control checks on raw sequence data coming from high throughput sequencing pipelines. It runs a set of analysis on one or more raw sequence files in fastq or bam format and produces a report which summarizes the results and provides a modular set of analysis which can be used to give a quick impression of whether there are any problems with the data or not.

PDX_1_expression_genes_DeSeq2_All.txt: DESeq2 is a tool for differential gene expression analysis of RNA-seq data. The “PDX_1_expression_genes_DeSeq2_All.txt” file contains values for the gene expression of each gene in all the samples and statistical tests performed on gene expression values of each gene to list out differentially expressed genes along with their statistical significance. Among all the statistical tests performed, some important tests are: . P-Value: Measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. P-Adj Value: A p-value adjustment is the adjustment of a p-value of a single significance test which is a part of an A/B test so that it conforms to the rejection region of an overall null hypothesis that spans a set of logically related significance tests. Log2FoldChange: This value indicates how much the gene or transcript's expression seems to have changed between the comparison and control groups. This value is reported on a logarithmic scale to base 2.

PDX_1_expression_genes_FPKM_DeSeq2_All.txt: This file contains the differentially expressed gene ids (along with their tested statistical values) with respect to FPKM (fragments per kilobase of exon per million mapped fragments) counts.

PDX_1_expression_genes_not_filtered_DeSeq2_All.txt (with 0 values): This is a non filtered file which contains all gene expression & statistical values including zero values (missing value)

PDX_1_expression_isoforms_DeSeq2_All.txt: This file yields the differentially expressed gene ids (along with their tested statistical values) for each isoform in the dataset.

Following are some of the interesting plots that one gets by running the Bulk RNA Seq pipeline. Lets now try to understand what each plot signifies:?

Volcano Plot | IFC Plot | MA Plot

Shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes.

The plot visualizes the differences between measurements taken in two samples, by taking log fold change on Y axis and log of normalized counts on X axis.

?MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values.

To learn more about each section & get a practical hands on experience, get started with “Transcriptomics” coursework on the OmicsLogic Learn Portal.?

Link to the Course: https://learn.omicslogic.com/courses/course/course-5-transcriptomics?

You can also enroll for the mentor guided "Transcriptomic Data Analysis" Training Program: https://learn.omicslogic.com/programs/transcriptomics-for-biomedical-research

OmicsLogic offers research services in the field of bioinformatics to support biologists and bioinformaticians with tasks such as data preprocessing, exploratory data analysis, and downstream analysis. If you require assistance with your data and are interested in our research services, please don't hesitate to contact us at [email protected].

要查看或添加评论，请登录

Sonalika Ray的更多文章

Navigating the Path to a Bioinformatics Career: A Guide for Biologists

2024年3月7日

Navigating the Path to a Bioinformatics Career: A Guide for Biologists

This article is a springboard for biologists seeking to transition into the exciting field of bioinformatics. While the…

3 条评论
Why Bioinformatics? Unravelling the Power of Data for Biologists

2024年3月5日

Why Bioinformatics? Unravelling the Power of Data for Biologists

Unlocking the Power of Data: The Role of Bioinformatics in Modern Biology In the vast landscape of biology, the sheer…

2 条评论
Unlocking the RNA-Seq Treasure Trove with OmicsLogic: A Guide to Data Extraction

2023年10月31日

Unlocking the RNA-Seq Treasure Trove with OmicsLogic: A Guide to Data Extraction

In the realm of genomics, RNA sequencing (RNA-Seq) stands as a powerful tool, illuminating the intricate world of gene…

1 条评论
Computational Validation Strategies for RNA-Seq Results

2023年10月24日

Computational Validation Strategies for RNA-Seq Results

Validating RNA-Seq results through computational analysis alone can provide valuable insights and increase confidence…
OmicsLogic Metagenomic Data Analysis using DADA2 Pipeline on T-Bioinfo Server

2023年10月23日

OmicsLogic Metagenomic Data Analysis using DADA2 Pipeline on T-Bioinfo Server

The microbial world is a treasure trove of hidden wonders, and metagenomic data analysis is our guide to discovering…

1 条评论
A Step-by-Step Roadmap to RNA-Seq Data Analysis

2023年10月19日

A Step-by-Step Roadmap to RNA-Seq Data Analysis

In the realm of genomics, RNA-Seq data analysis stands as a critical gateway to understanding gene expression…

3 条评论
Unveiling the Diversity of Transcriptomic Studies: Types, Case Studies, and Learning Resources

2023年10月16日

Unveiling the Diversity of Transcriptomic Studies: Types, Case Studies, and Learning Resources

Transcriptomics, the study of RNA molecules in a cell or tissue, is a versatile field that encompasses various types of…

1 条评论

See all articles

Overview of Bulk RNA-Seq on T-Bioinfo Server

Sonalika Ray

?? Bioinformatics Research Consultant | IPA Certified | PhD Scholar

Input files required for processing the pipeline

Steps for Processing Pipeline

领英推荐

Output Files: Obtained when pipeline processing is complete

Sonalika Ray的更多文章

社区洞察

其他会员也浏览了

Single Cell RNA-Seq Analysis with OmicsLogic: Empowering Your Research

How to choose Normalization methods (TPM/RPKM/FPKM) for mRNA expression

AAV Data Hub Relaunch

Nanopore direct RNA sequencing applied for characterizing bacterial transcriptomes and epitranscriptomes

Charting the Genomic Frontier: A Journey of Innovation

Delta-Delta, Over and Out: Understanding the Value of the Pfaffl Equation for Relative Quantification.

Influencing Our Genetic Expression for Better Health

Upgrade your RNA quality control with Lunatic

How Unlocking the Secrets of the Dark Genome Could Catalyze Innovation

Improved Understanding of miRNA Biogenesis

Input files required for processing the pipeline

Steps for Processing Pipeline

领英推荐

Output Files: Obtained when pipeline processing is complete

Sonalika Ray的更多文章

Navigating the Path to a Bioinformatics Career: A Guide for Biologists

Why Bioinformatics? Unravelling the Power of Data for Biologists

Unlocking the RNA-Seq Treasure Trove with OmicsLogic: A Guide to Data Extraction

Computational Validation Strategies for RNA-Seq Results

OmicsLogic Metagenomic Data Analysis using DADA2 Pipeline on T-Bioinfo Server

A Step-by-Step Roadmap to RNA-Seq Data Analysis

Unveiling the Diversity of Transcriptomic Studies: Types, Case Studies, and Learning Resources

社区洞察

其他会员也浏览了

Single Cell RNA-Seq Analysis with OmicsLogic: Empowering Your Research

How to choose Normalization methods (TPM/RPKM/FPKM) for mRNA expression

AAV Data Hub Relaunch

Nanopore direct RNA sequencing applied for characterizing bacterial transcriptomes and epitranscriptomes

Charting the Genomic Frontier: A Journey of Innovation

Delta-Delta, Over and Out: Understanding the Value of the Pfaffl Equation for Relative Quantification.

Influencing Our Genetic Expression for Better Health

Upgrade your RNA quality control with Lunatic

How Unlocking the Secrets of the Dark Genome Could Catalyze Innovation

Improved Understanding of miRNA Biogenesis