Overview of Bulk RNA-Seq on T-Bioinfo Server
RNA-Seq technology provides insights into how cells & tissues function by measuring the levels of gene expression. Since all normal cells within an organism possess the same genome, the differences in cell identities and functions are determined by gene expression. Bulk RNA-Seq (whole transcriptome sequencing) experiments produce a view of gene expression of an entire sample. However, they do not differentiate among cell types within the sample, rather they give a view of gene expression within a whole organ or tissue type. This method has been instrumental in the development of many single-cell RNA sequencing methods.?
Bulk RNA-Seq data analyses consists of the following key steps:?
Now let's have a look at an example Bulk RNA Seq pipeline on the T-Bioinfo Server: https://server.t-bio.info/pipelines/3931418?time=1653977840 and learn about the types of input files that should be uploaded, parameters chosen to run the pipeline, processing pipeline and finally what the output files look like.?
Input files required for processing the pipeline
To run the Bulk RNA-Seq pipeline, following are the optional parameters and type of input files that could be uploaded.?
Preliminary Parameters
Various options are available to set the parameters as listed:?
To upload the input files, a user can upload the input file to run the pipeline in various formats as mentioned below:
Next, we need to put the samples together based on the specific group that they belong to, as in the demo pipeline that we are considering as an example here, we will put the file under one group and proceed ahead.
Contrast is used to represent categorical IVs (factors) in modeling. In particular, it is used to recode a factor into a set of "contrast variables". We will select all and then finally submit all changes to run the pipeline.
Steps for Processing Pipeline
To run the pipeline we need to follow the following workflow:
Start > PCR Clean > Trimmomatic > Bowtie-2t > RsemExpTable > DeSeq2
Lets now understand the functionality of each step in the pipeline.?
Start: RNA-Seq analysis pipeline starts with a job called “Start” that compiles user selected data input options into a series of tags and generates the correct pipeline options, reducing the number of possible algorithms to the ones that can handle the input data.
Based on the highlighted buttons, now you can create your pipeline using the graphical interface. By right-clicking the selected button, you will be able to deselect it.
Some buttons will open a parameters dialog box. After selecting all the desired options, select the “end” button to give the pipeline a name, upload data and run the pipeline.
领英推荐
Trimmomatic:
Trimmomatic algorithm cleans raw sequencing reads from technical adapters. The Trimmomatic pre-processing step is usually performed to ensure better quality of alignment of reads on the reference genome.
?PCR Clean: PCR Clean module cleans all duplicate reads from raw sequencing data. The presence of duplicate reads from polymerase chain reaction (PCR) amplification can distort estimates of gene expression levels and so the duplicated reads should be eliminated prior to processing the data. Input formats for the module are fastQ or fastA raw sequencing reads. After cleaning PCR duplicates, the output is given in the same format as input (fastQ or fastA). For more info on PCR amplification please see: https://www.ncbi.nlm.nih.gov/probe/docs/techpcr/
Bowtie2: Bowtie2? is a fast alignment algorithm that is based on the “seed” (or k-mer) approach. “Seed” substrings from the read and their reverse complements are extracted and aligned to the reference in an ungapped fashion. Then their positions on the reference are recorded and they are extended into full alignments using SIMD-accelerated dynamic programming
?For more info please see: Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357-359. doi:10.1038/nmeth.1923.?
RSem: RSEM is an algorithm that quantifies transcript abundances based on the alignment file (sam file). RSEM will give out two files: gene expression and isoform expression. The counts are in FPKM (stands for Fragments Per Kilobase of transcript per Million mapped reads). We can choose in which format we want our Rsem Expression Type to be in “FPKM”, “expected_count”, “TPM”. For more info please see: Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. doi:10.1186/1471-2105-12-323.
DeSeq2: A common difficulty in the analysis of NGS data is the strong variance of LFC (logarithmic fold change) estimates for genes with low read count. DESeq2 overcomes this issue by shrinking LFC estimates toward zero in a manner such that shrinkage is stronger when the available information for a gene is low, which may be because counts are low, dispersion is high or there are few degrees of freedom.
Output Files: Obtained when pipeline processing is complete
After the pipeline has completed its processing, you will obtain a list of output files that could be downloaded to carry out statistical analysis and interpret biological insights. You will also obtain data visualizations in your output files that make sense to understand meaningful patterns or significant results.?
Following are some of the interesting plots that one gets by running the Bulk RNA Seq pipeline. Lets now try to understand what each plot signifies:?
Volcano Plot | IFC Plot | MA Plot
Shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes.
The plot visualizes the differences between measurements taken in two samples, by taking log fold change on Y axis and log of normalized counts on X axis.
?MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values.
To learn more about each section & get a practical hands on experience, get started with “Transcriptomics” coursework on the OmicsLogic Learn Portal.?
Link to the Course: https://learn.omicslogic.com/courses/course/course-5-transcriptomics?
You can also enroll for the mentor guided "Transcriptomic Data Analysis" Training Program: https://learn.omicslogic.com/programs/transcriptomics-for-biomedical-research
OmicsLogic offers research services in the field of bioinformatics to support biologists and bioinformaticians with tasks such as data preprocessing, exploratory data analysis, and downstream analysis. If you require assistance with your data and are interested in our research services, please don't hesitate to contact us at [email protected].