What Do You Know About High-Throughput Sequencing?

What Do You Know About High-Throughput Sequencing?

What Is First-Generation Sequencing?

The first generation sequencing technology, also known as Sanger sequencing. It utilizes the principle that Dideoxynucleotide will terminate PCR. For example, if one sequence is ATCGCTA, we conduct three Dideoxynucleotide, and add Dideoxynucleotide A and normal ATCG for the first time, we will get the following two sequences, A and ATCGCTA. So we know that base A is located at the first and seventh bases of the sequence. Similarly, by using dinucleotides T and C, the corresponding base position BP information of the entire sequence can be obtained. Then obtain the sequence information of the entire ATCG sequence. Of course, these are all tested by instruments.

The characteristics of the first generation sequencing are fast, but only one single sequence can be tested at a time, and the longest can also be measured at 1000-1500bp. So it is widely used in single sequence sequencing. Simply put, one generation sequencing can only measure a sequence with a length of around 1000 bp. Widely used for detecting mutation sites in single sequences.

What Is Second-Generation Sequencing?

Second-generation sequencing technology, also known as high-throughput sequencing technology. It solves the defect that first generation sequencing can only measure one sequence. With the deepening of scientific research, we begin to analyze all the sequence information in a species or sample, and at this time, one generation of sequencing can not meet our needs one sequence at a time. This is where second-generation sequencing was born. It is called high-throughput sequencing because it is capable of sequencing many sequences at once. We physically or chemically break the DNA into numerous small fragments (250-300bp), and then enrich these fragments by building a library. Next, the library is sequenced in a sequencer, which has regions for DNA fragments to attach to, and each fragment has a separate attachment region, so that the sequencer can detect all the attached DNA sequence information at once. Finally, the small fragments are spliced into long fragments by bioinformatics analysis.

Characteristics of second-generation sequencing: A large number of sequences can be sequenced at one time, but the fragments are limited to 250-300bp, and some sequences may be sequenced several times because they are spliced through the overlapping regions of the sequences. Since PCR is utilized to enrich sequences in library construction, some sequences with small amounts may not be amplified in large amounts, resulting in loss of some information, and there is a probability that PCR will introduce mismatched bases. So this is how third-generation sequencing was born.

What Is Third-Generation Sequencing?

The third generation sequencing is actually an upgrade to the second generation sequencing. Simply put, it can test many sequences at once, but the sequencing length reaches about 10kb, and it does not require PCR enrichment sequence. Direct sequencing solves the problem of information loss and base mismatch. However, at present, third-generation sequencing still has some defects: third-generation sequencing technology relies on the activity of DNA polymerase, and its cost is very high, which is much higher than the error rate of second-generation sequencing technology. Fortunately, third-generation errors occur completely randomly, and can be corrected by coverage (but this will increase the cost of sequencing).

「Second- and third-generation sequencing are collectively referred to as high-throughput sequencing, also known as "next-generation sequencing" or "deep sequencing".」

What Is de novo Sequencing?

de novo sequencing, also known as ab initio sequencing, sequences a species without any existing sequence information, and uses bioinformatics analysis to splice and assemble the sequences to obtain a genomic map of the species.

What Is?Whole Genome Resequencing (WGS)?

Whole genome resequencing is the sequencing of genomes of different individuals of species with known genome sequences, and based on this, differential analysis of individuals or groups. High-throughput sequencing through the strategy of constructing insertion fragment libraries of different lengths combined with short sequence and double-end sequencing realizes the detection of common, low-frequency, or even rare mutation sites related to diseases or plant and animal traits, as well as structural variations, etc., at the level of the whole genome, which is of great scientific research and industrial value.

What Is Pan-Genome?

In 2005, Tettelin et al. proposed the concept of pan-genome for the first time in bacterial research. It refers to a non-redundant collection of genes/genome sequences from an entire species, including the core genome, which is present in almost all individuals of the species, and the accessory/variable/dispensable genome, which is present in only some individuals. A generalized pangenome is a collection that captures all the genetic information of a species.

What Is BSA Sequencing?

BSA (Bulked Sergeant Analysis), or mixed segregant grouping analysis, also known as cluster segregation analysis or mixed grouping analysis, evolved from near-isogenic lineage (NIL) analysis. A near-isogenic line (NIL) refers to a group of strains with the same or similar genetic background, differing only in individual chromosome segments. The BSA method overcomes the limitation of not having or having difficulty in creating NILs for many crops, and is based on the principle of selecting 10-20 single strains from a segregating population that are extreme in terms of the phenotype of the target trait, and mixing them to construct two DNA "pools", which should be used to analyze the DNA of the two pools. ", which should differ in the trait of interest, with all loci randomized except for the locus where the gene of interest is located. In other words, the differences between the two DNA pools are equivalent to the differences between the genomes of two near-isogenic lines, differing only in the target region, while the entire genetic background is identical. The two pools are screened for markers, and polymorphic markers may indicate linkage to a particular gene or QTL of interest. When testing for polymorphisms between two DNA pools, the DNA of both parents should usually be used as a control to facilitate proper analysis and judgment of the experimental results.

What Is Whole Exome Sequencing (WES)?

Whole exome sequencing is a genomic analysis method that utilizes sequence capture technology to capture and enrich DNA from exonic regions of the whole genome for high-throughput sequencing. Exome sequencing is less costly compared to genome resequencing, and has a greater advantage in studying SNPs and Indel of known genes, but it is unable to study genome structural variants such as chromosome breakage recombination.

What Is RNA sequencing (RNA seq)?

RNA sequencing is the study of the sum of all the mRNAs that can be transcribed by a given cell in a given functional state.

RNA-seq allows researchers to study transcript structure (gene boundary identification, variable shear studies, etc.), transcript variation (e.g., gene fusions, coding region SNP studies), non-coding region function (non-coding RNA studies, microRNA precursor studies, etc.), gene expression levels, and discovery of novel transcripts.

What Is Epigenetics?

Epigenetics, also translated as mimetic genetics and postgenetics, is a field of biology and specific genetics that investigates?"changes in the expression of genetically heritable genes or cellular phenotypes that can be induced by certain mechanisms without alteration of the DNA sequence".

What Is ChIP-seq?

ChIP-seq refers to the specific enrichment, purification, library construction and sequencing of target protein binding DNA fragments by Chromatin immunoprecipitation (ChIP); accurately locate the millions of sequence tags obtained onto the genome, thereby obtaining DNA segment information that interacts with histones, transcription factors, and other proteins throughout the entire genome.

What Is?ATAC-Seq?

ATAC-Seq stands for "Assay for Transposase-Accessible Chromatin with high-throughput Sequencing". The ATAC-Seq method relies on the construction of next-generation sequencing (NGS) libraries using the highly active transposase Tn5.

NGS adaptors are attached to transposases that break chromatin and simultaneously integrate these adaptors into open chromatin regions. The constructed library can be sequenced by NGS and analyzed using bioinformatics for genomic regions with accessible or accessible chromatin.

The main advantages of ATAC-Seq over other techniques (e.g., FAIRE-Seq or DNase-Seq, which study similar chromatin characteristics) are the smaller number of cells required for the assay and the relative simplicity of its two-step operation.

What Is?Methylation Sequencing?

DNA methylation is an important component of Epigenetics, which plays an important role in the maintenance of normal cellular function, genetic imprinting, embryonic development, and human tumorigenesis. Whole Genome Bisulfite Sequencing (WGBS) employs bisulfite treatment of genomic DNA to convert unmethylated cytosine C into uracil U. By performing whole genome resequencing of the treated DNA and comparing it with the reference genome, single-base resolution and high-precision methylation level analysis are achieved at the genomic level. It is widely used in basic mechanism research such as cell differentiation and tissue development, as well as in animal and plant breeding, human health and disease treatment.

What Is?Small RNA?Sequencing?

Small RNA (e.g. miRNA, ncRNA, siRNA, etc.) are a large class of regulatory molecules that exist in almost all living organisms, and play important roles in physiological processes such as the regulation of gene expression, bioindividual development, metabolism, and the occurrence of diseases. Through the large-scale sequencing analysis of small RNA, miRNA profiles at the genome-wide level of species can be obtained from them, and scientific applications including mining of new small RNA molecules, prediction and identification of their target genes, differential expression analysis among samples, small RNA clustering and expression profiling can be realized.

What Is?lncRNA?Sequencing?

lncRNA is a class of non-coding RNA greater than 200 nt in length that cannot encode proteins ≥30 amino acids in length. lncRNA can be localized in the cytoplasm, as well as in chromatin and the nucleus, and regulate the action of related genes through a variety of modes, including epigenetic regulation, transcriptional regulation, post-transcriptional regulation, and regulation of protein activity. lncRNA has a shorter average transcript length and fewer isoforms than mRNA, and although lncRNA is expressed at a lower level, it has tissue and cell type specificity.

What Is?Microbial Diversity Sequencing?

Sequencing of microbial diversity, through amplification and high-throughput sequencing of 16S rDNA, 18S rDNA, and ITS high-variable regions of microorganisms, can analyze the species composition and relative abundance of bacteria, archaea, and fungi in the environment, and obtain the structure of microbial communities, evolutionary relationships, and microbial-environmental correlations of the environmental samples.

What Is?Macrogenomic Sequencing?

Metagenomics Sequencing (MSS) is a high-throughput sequencing method to study the genomes of microbial communities in specific environments, analyze the microbial diversity, population structure, gene function, metabolic network and evolutionary relationships, and further investigate the functional activities, interactions, and relationships between microbial communities and their environments. Macro-genome sequencing research is free from the limitations of microbial isolation and pure culture, expanding the utilization space of microbial resources, and providing an effective tool for the study of environmental microbial communities.

What Is?Single-Cell Sequencing?

Single-cell sequencing technology, in short, is a technology for sequencing and analyzing genome, transcriptome and epigenome at the level of a single cell. Traditional sequencing is carried out on a multicellular basis, but in reality, it obtains the mean of signals in a pile of cells, losing information about cell heterogeneity (differences between cells). The single-cell sequencing technology can detect heterogeneous information that cannot be obtained by hybrid sample sequencing, thus solving this problem well.

In traditional second-generation sequencing, the most well-known is RNA-seq, which extracts the mixed RNA (bulk RNA) of a tissue, organ or a group of cells for sequencing, and what can be obtained is the average data of the transcriptome of a group of cells, and the specific information of individual cells in the cell population is often masked (e.g., specifically expressed genes or different spliceosomes of RNA). And with the in-depth study of biological structure and function, it has become increasingly clear that there are differences in transcriptome expression levels between cells, even in seemingly identical cell populations. Taking tumor as an example, the cells in the center of the tumor, the cells at the edge of the mass and the cells around the mass, and even the cells metastasized at the distal end, there must be differences in their transcriptomes and other genetic information, and the traditional means of research usually study the whole mass as a whole, or divide the mass into simple partitions to get the average of the gene expression of cells in each part, which loses the information on the heterogeneity of each cell, and makes the researchers have a clear understanding and knowledge of transcriptome expression and immune function of various cells in the tumor microenvironment can never be deepened.

What Is?Spatial Transcriptome Sequencing?

In multicellular organisms, gene expression in individual cells occurs strictly in a specific temporal and spatial order, i.e., gene expression is temporally specific and spatially specific. Temporal specificity can be resolved by taking samples from different time points and using single-cell transcriptome sequencing to analyze cell types and gene expression patterns in the temporal dimension. Spatial specificity information is relatively difficult to obtain. Both conventional transcriptome sequencing and single-cell transcriptome sequencing have difficulty in restoring information about the original location of cells. Conventional in situ hybridization techniques are again difficult to achieve high-throughput detection.

10X genomic Visium Spatial Gene Expression Solution measures total mRNA in complete tissue sections, combines spatial information on total mRNA with morphological content, and maps where all gene expression occurs to obtain a complex and complete gene expression map of the disease. Spatial locations are preserved while different cell populations are identified, providing important information about cell function, phenotype, and positional relationships in the tissue microenvironment.

After sequencing, when you get the sequencing data, you may encounter these problems:

What Is?Reads?

The short sequences generated by the high-throughput sequencing platform are called reads. PE125 is a dual ended sequencing with a read length of 125bp.

What Is?Sequencing Depth and Coverage?

Sequencing Depth: the ratio of the total number of bases obtained by sequencing (bp) to the size of the genome, which is one of the indicators for evaluating the amount of sequencing. Assuming that the size of a gene is 2M and the sequencing depth is 10X, the total data obtained is 20M, which can also be interpreted as the average number of times a single base has been sequenced on the genome under test.

Coverage of sequencing: It refers to the proportion of sequences obtained by sequencing to the whole genome, which can also be understood as the degree of coverage of the target genes. Due to the existence of high GC, repetitive sequences and other complex structures in the genome, the sequences obtained by sequencing and splicing assembly often fail to cover some regions, and this part of the unobtained region is called Gap. e.g., if a bacterial genome is sequenced and the coverage is 98%, then there are still 2% of the sequence region that is not obtained by sequencing.

What Is?Single-End Sequencing, Double-End Sequencing?

Roche 454, Solexa and ABI SOLID are available for single-end sequencing and double-end sequencing. Using solexa as an example, single-end sequencing (Single-end) and double-end sequencing (Paired-end and Mate-pair) are described.

Single-end sequencing (Single-end/SE): DNA samples are first fragmented to form a 200-500p fragment, primer sequences are attached to one end of the DNA fragment, and then an adaptor is added to the end, and the fragment is immobilized on a flowcell to generate DNA clusters, which are then sequenced on the machine with single-end read sequences.

Paired-end (PE): refers to the addition of sequencing primer binding sites to both ends of the splice during the DNA library construction to be tested. After the completion of the first round of sequencing, the template strand of the first round of sequencing is removed, and the paired-read sequencing module is used to guide the regeneration and amplification of the complementary strand at the original position to achieve the amount of template used in the second round of sequencing for the second round of sequencing for the synthesis of complementary strand sequencing.

Mate-pair (MP): library preparation is designed to generate short DNA fragments that contain sequences at the ends of large spanning (2-10k) segments of the genome, more specifically: genomic DNA is first randomly interrupted to a specific size (2-10k range is optional); then after experimental steps such as end-repairing, biotin labeling and cyclization, the cyclized DNA is then broken into 400-600p fragments and the biotin-labeled fragments are captured by magnetic beads with streptavidin. These captured fragments are then modified at the end and specific adaptors are added to build a mate-pair library, which is then sequenced on the machine.

When you are resequencing your genome, you may encounter these problems:

What Is?SNP、SNV?

「SNP:」?"Single nucleotide polymorphism" is a polymorphism caused by single nucleotide variations (substitutions, insertions or deletions) at the same position in the DNA sequence of the genome between individuals, and it is an important basis for the study of genetic variations in the human family and animal and plant strains. On average, one single nucleotide polymorphism may occur in every 1,000 nucleotides in the human genome, some of which may be related to disease, but most are not.

「SNV:」?That is, single nucleotide variants. Compared with normal tissues, the specific single nucleotide variation in cancer is a somatic mutation, called SNV.

What Is?INDEL?

Insertions or deletions of small fragments (<50bp) on the genome, shaped as SNP/SNV.

What Is?CNV, SV?

「CNV:」Copy number variation is a form of genomic variation that typically results in large segments of DNA in the genome forming an abnormal number of copies.

「SV:」That is, the genome structure variation, mainly including the insertion and deletion of large segments of chromosomes (causing changes in the CNV), a region within the chromosome flipped upside down, recombination between the two chromosomes (inter-chromosome trans-location) and so on.

What Is?SD?Region?

「SD?Region:」Refers to a segment duplication, which consists of a number of DNA segments in tandem that are similar in sequence. There are large SD sequences on human chromosomes Y and 22.

You may encounter these problems when analyzing transcriptome data:

What are transcripts? Why can a gene have multiple transcripts?

A transcript is actually one or more mature mRNA that a gene transcribes to code for proteins, but when we usually look up information about a gene in a database, we find that there are multiple transcripts for that gene. Why can a gene have multiple transcripts?

This is due to the different splicing methods. After gene transcription, the precursor mRNA is formed first, and the mature mRNA is formed by cutting the intron to connect the exon, and adding the cap at the 5' end and the tail at the 3' end. However, the exon may be cut off during the process of splicing, and part of the intron may be retained, which results in the formation of multiple mRNAs, i.e., multiple transcripts.

What Is?RPKM、FPKM?

Both RPKM and FPKM are used to indicate gene expression.

RPKM: Reads Per Kilobases per Millionreads, represents the number of reads per kilobase length from a gene per million reads and is used to indicate the amount of gene expression.

FPKM: Fragments per Kilobase Million, the meaning of FPKM is very similar to RPKM, the difference between the two is Fragments and Reads.

RPKM was created for early SE sequencing, and FPKM is a correction of RPKM on PE sequencing. As long as the difference between Reads and Fragments is clear, the concepts of RPKM and FPKM can be easily differentiated. Reads refers to each Read in the fastq data after downstreaming, and Fragments refers to each nucleic acid fragment used for sequencing; in SE, a Fragments measures only one Read, so the number of Reads is equal to the number of Fragments; in PE, a Fragments measuring both ends will get 2 Reads, but due to the filtering of quality or comparison in the later stage, it is possible that only one of the 2 Reads of a Fragments will enter the final expression analysis in the end. In short, for a certain pair of Reads, these 2 Reads can only be counted as one Fragments, so the final number of Fragments is between 1 and 2 times the number of Reads.

You may encounter these problems when you want to perform genome/transcript assembly:

What Is?Contig?

The splicing software is based on the overlap region between the reads, and the sequence obtained by splicing is called Contig.

What Is?Contig N50?

After Reads splicing, you will get some Contigs with different lengths, add all the Contig lengths together, you can get a Contig total length. Sort the Contigs from longest to shortest, such as Contig 1, Contig 2, Contig 3... Contig 25. .........Contig 25. Then add them in this order, when the added length reaches half of the total length of the Contigs, the last added Contig length will be Contig N50.

If Contig 1+Contig 2+ Contig 3+Contig4=Total length of Contig*1/2, the length of Contig 4 is Contig N50. Contig N50 can be used as a criterion for judging whether the result of genome splicing is good or bad.

What Is?Scaffold?

After genome de novo?sequencing and obtaining Contigs by reads splicing, it is often necessary to construct 454 Paired-end libraries or Illumina Matepair libraries to obtain sequences at the ends of fragments of certain sizes (e.g., 3Kb, 6Kb, 10Kb, 20Kb). Based on these sequences, the sequential relationship between some Contigs can be determined, and these Contigs with known sequential order form Scaffold.

What Is?Scaffold N50?

Scaffold N50 is similar to the definition of Contig N50, where Contigs are spliced and assembled to obtain a number of Scaffolds of different lengths.The lengths of all Scaffolds are added together to obtain a total length of Scaffolds. Then all the Scaffolds are sorted from longest to shortest, and then added in this order, when the added length reaches half of the total length of Scaffolds, the last added Scaffold length is Scaffold N50, which is also a criterion for judging whether the result of genome splicing is good or bad.

What Is?Genome Annotation?

Genome annotation is a high-throughput annotation of the biological functions of all genes in the genome using bioinformatics methods and tools, which is a hot spot in the current functional genomics research. The research content of genome annotation includes gene identification and gene function annotation. The core of gene identification is to determine the exact location of all genes in the whole genome sequence.

Genome annotation analysis includes the following main aspects

(1) Repeat sequence prediction. By comparing the database of known repetitive sequences, find out the repetitive sequences contained in the sequence, identify the type and transform it into N or X, and count the distribution of various types of repetitive sequences.

(2) Coding gene prediction. The coding gene structure is predicted by comparing the transcriptome or EST data to the spliced genome sequence to find out the coding gene position. Alternatively, the exon structure of coding genes can be predicted by specialized exon prediction software.

(3) Prediction of small RNA genes. These small RNA genes are identified and categorized by comparing databases of known small RNAs or predicted by bioinformatics software.

(4) Prediction of regulatory sequences and pseudogenes.

Gene function is annotated using databases such as NT/NR, SwissProt/TrEMbl, InterPro, KEGG, COG, Gene ontology, etc., and homologous and similar genes are identified and functionally annotated using a comparative approach.

「What are the databases used for gene annotation?」

「(1) NR/NT database」

NR/NT database is the more commonly used database on NCBI. NR: Non-Redundant Protein Sequence Database, including all the non-redundant protein sequences in GenBank+EMBL+DDBJ+PDB. It cross-indexes on the basis of nucleic acid sequences, linking nucleic acids to proteins. For known or possible coding sequences, the corresponding amino acid sequences are given in the NR records (inferred from the reading frame). NT: Non-redundant nucleic acid sequence database, a subset of the NR library.

Both NR and NT libraries can perform online BLAST through NCBI

「(2) Swiss-Prot:」It is a checked and manually annotated protein database, and all of its sequences have been verified by scientists reviewing the literature. Swiss-Prot can provide detailed protein sequence and functional information, such as protein functional description, structure of structural domains, post-transcriptional modifications, modification sites, degree of variability, secondary structure, etc., and other databases, including sequence databases, three-dimensional structure databases, 2-D cohesive electrophoresis database, and protein family database with corresponding links.

Swiss-Prot has now been merged into the UniProt database, and together with TrEMBL and PIR-PSD constitute the three main UniProt databases.

「(3) COG :」Clusters of Orthologous Groups of proteins, clusters of straight homologous proteins, a database that considers the proteins that make up each COG to be hypothesized to be derived from an ancestral protein.

COG is divided into two types, one is Prokaryote, and the other is Eukaryote. Prokaryote is generally called COG database; Eukaryote is generally called KOG database.

「(4) KEGG:」Kyoto Encyclopedia of Genes and Genomes, an integrated database dealing with the links between genomes, biological pathways, diseases, drugs and chemicals.

The centerpiece of this is the KEGG Pathway database, which is subdivided into 3 tiers:

Tier 1: biometabolic pathways are categorized into 7 broad categories; metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, human diseases, and drug development;

Tier 2: the 7 categories in Tier 1 are further refined;

Tier 3: directly corresponds to KEGG's pathways, and each pathway is labeled with the genes involved in the process

「(5) GO:」?Gene Ontology (GO) is a database of gene ontology. The most basic concept in GO is "term", which is used to describe the characteristics of genes and gene products, i.e., the GO database is to label each gene, so that the researcher can find the target genes quickly through the labels.

In the GO analysis, all results were organized and classified according to the following 3 first-level functions:

Cytological Components (CC): used to characterize subcellular structures, locations, and macromolecular complexes such as nucleoli, telomeres, and complexes that recognize initiation;

Biological Pathways (BP): refers to ordered combinations of molecular functions to reach broader biological functions, such as mitosis or purine metabolism;

Molecular Function (MF): used to describe the function of genes, gene products, e.g. binding to carbohydrates or ATP hydrolase activity.



Reference source: [For Beginners] High-throughput sequencing common terms summary, minutes to help you solve the problem - (zhihu.com)

要查看或添加评论,请登录

ALLSHENG的更多文章

社区洞察

其他会员也浏览了