Overview of Bulk RNA-Seq on  T-Bioinfo Server
Omicslogic Transcriptomics - Bulk & Single Cell Data Analysis ~ Statistical Analysis ~ Machine Learning

Overview of Bulk RNA-Seq on T-Bioinfo Server

RNA-Seq technology provides insights into how cells & tissues function by measuring the levels of gene expression. Since all normal cells within an organism possess the same genome, the differences in cell identities and functions are determined by gene expression. Bulk RNA-Seq (whole transcriptome sequencing) experiments produce a view of gene expression of an entire sample. However, they do not differentiate among cell types within the sample, rather they give a view of gene expression within a whole organ or tissue type. This method has been instrumental in the development of many single-cell RNA sequencing methods.?


Bulk RNA-Seq data analyses consists of the following key steps:?

  • Quality check and preprocessing of raw sequence reads: PCR Clean & Trimmomatic?
  • Mapping reads to a reference genome or transcriptome: Bowtie-2
  • Counting reads mapped to individual genes or transcripts: RSemExpTable?
  • Identification of differential expression: DeSeq2

Now let's have a look at an example Bulk RNA Seq pipeline on the T-Bioinfo Server: https://server.t-bio.info/pipelines/3931418?time=1653977840 and learn about the types of input files that should be uploaded, parameters chosen to run the pipeline, processing pipeline and finally what the output files look like.?


Input files required for processing the pipeline

To run the Bulk RNA-Seq pipeline, following are the optional parameters and type of input files that could be uploaded.?

Preliminary Parameters

Various options are available to set the parameters as listed:?

  • Type of Reference Genome: RModelGenomeGTF, RGenome, RGenomeGTF
  • Format of NGS Data: fastQ, fasta, SAM_BAM, SimulateFQ, SimulateFA
  • Single or Pair end reads: SE, PE
  • Organism: To select reference genome
  • File Upload

To upload the input files, a user can upload the input file to run the pipeline in various formats as mentioned below:

  • The “txt” files can be uploaded directly under “Upload Files” option, or
  • Files could be uploaded from a “Link”, or
  • We can also upload the “NCBI Run Table” file, or
  • Upload “.txt”or”.svl” file to bulk import URLs

  • Group Selection

Next, we need to put the samples together based on the specific group that they belong to, as in the demo pipeline that we are considering as an example here, we will put the file under one group and proceed ahead.

  • Contrasts

Contrast is used to represent categorical IVs (factors) in modeling. In particular, it is used to recode a factor into a set of "contrast variables". We will select all and then finally submit all changes to run the pipeline.


Steps for Processing Pipeline

RNA-Seq analysis on the T-Bioinfo Server (Generating Gene Expression Table from raw reads)

To run the pipeline we need to follow the following workflow:

Start > PCR Clean > Trimmomatic > Bowtie-2t > RsemExpTable > DeSeq2

Lets now understand the functionality of each step in the pipeline.?

Start: RNA-Seq analysis pipeline starts with a job called “Start” that compiles user selected data input options into a series of tags and generates the correct pipeline options, reducing the number of possible algorithms to the ones that can handle the input data.

Based on the highlighted buttons, now you can create your pipeline using the graphical interface. By right-clicking the selected button, you will be able to deselect it.

Some buttons will open a parameters dialog box. After selecting all the desired options, select the “end” button to give the pipeline a name, upload data and run the pipeline.

Trimmomatic:

Trimmomatic algorithm cleans raw sequencing reads from technical adapters. The Trimmomatic pre-processing step is usually performed to ensure better quality of alignment of reads on the reference genome.

?PCR Clean: PCR Clean module cleans all duplicate reads from raw sequencing data. The presence of duplicate reads from polymerase chain reaction (PCR) amplification can distort estimates of gene expression levels and so the duplicated reads should be eliminated prior to processing the data. Input formats for the module are fastQ or fastA raw sequencing reads. After cleaning PCR duplicates, the output is given in the same format as input (fastQ or fastA). For more info on PCR amplification please see: https://www.ncbi.nlm.nih.gov/probe/docs/techpcr/

Bowtie2: Bowtie2? is a fast alignment algorithm that is based on the “seed” (or k-mer) approach. “Seed” substrings from the read and their reverse complements are extracted and aligned to the reference in an ungapped fashion. Then their positions on the reference are recorded and they are extended into full alignments using SIMD-accelerated dynamic programming

?For more info please see: Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357-359. doi:10.1038/nmeth.1923.?

RSem: RSEM is an algorithm that quantifies transcript abundances based on the alignment file (sam file). RSEM will give out two files: gene expression and isoform expression. The counts are in FPKM (stands for Fragments Per Kilobase of transcript per Million mapped reads). We can choose in which format we want our Rsem Expression Type to be in “FPKM”, “expected_count”, “TPM”. For more info please see: Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. doi:10.1186/1471-2105-12-323.

DeSeq2: A common difficulty in the analysis of NGS data is the strong variance of LFC (logarithmic fold change) estimates for genes with low read count. DESeq2 overcomes this issue by shrinking LFC estimates toward zero in a manner such that shrinkage is stronger when the available information for a gene is low, which may be because counts are low, dispersion is high or there are few degrees of freedom.


Output Files: Obtained when pipeline processing is complete

After the pipeline has completed its processing, you will obtain a list of output files that could be downloaded to carry out statistical analysis and interpret biological insights. You will also obtain data visualizations in your output files that make sense to understand meaningful patterns or significant results.?

Interactive dashboard - RNA-Seq Analysis on T-Bioinfo Server

  • FastQC.zip: FastQC provides a simple way to carry out quality control checks on raw sequence data coming from high throughput sequencing pipelines. It runs a set of analysis on one or more raw sequence files in fastq or bam format and produces a report which summarizes the results and provides a modular set of analysis which can be used to give a quick impression of whether there are any problems with the data or not.

  • PDX_1_expression_genes_DeSeq2_All.txt: DESeq2 is a tool for differential gene expression analysis of RNA-seq data. The “PDX_1_expression_genes_DeSeq2_All.txt” file contains values for the gene expression of each gene in all the samples and statistical tests performed on gene expression values of each gene to list out differentially expressed genes along with their statistical significance. Among all the statistical tests performed, some important tests are: . P-Value: Measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. P-Adj Value: A p-value adjustment is the adjustment of a p-value of a single significance test which is a part of an A/B test so that it conforms to the rejection region of an overall null hypothesis that spans a set of logically related significance tests. Log2FoldChange: This value indicates how much the gene or transcript's expression seems to have changed between the comparison and control groups. This value is reported on a logarithmic scale to base 2.

  • PDX_1_expression_genes_FPKM_DeSeq2_All.txt: This file contains the differentially expressed gene ids (along with their tested statistical values) with respect to FPKM (fragments per kilobase of exon per million mapped fragments) counts.

Following are some of the interesting plots that one gets by running the Bulk RNA Seq pipeline. Lets now try to understand what each plot signifies:?

Volcano Plot | IFC Plot | MA Plot

Shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes.

The plot visualizes the differences between measurements taken in two samples, by taking log fold change on Y axis and log of normalized counts on X axis.

?MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values.

To learn more about each section & get a practical hands on experience, get started with “Transcriptomics” coursework on the OmicsLogic Learn Portal.?

Link to the Course: https://learn.omicslogic.com/courses/course/course-5-transcriptomics?

You can also enroll for the mentor guided "Transcriptomic Data Analysis" Training Program: https://learn.omicslogic.com/programs/transcriptomics-for-biomedical-research

OmicsLogic offers research services in the field of bioinformatics to support biologists and bioinformaticians with tasks such as data preprocessing, exploratory data analysis, and downstream analysis. If you require assistance with your data and are interested in our research services, please don't hesitate to contact us at [email protected].

要查看或添加评论,请登录

Sonalika Ray的更多文章

社区洞察

其他会员也浏览了