Reading the Source Code of Life
Generated using MidJourney

Reading the Source Code of Life

Welcome to Part I of a multi-part #DIYgenomics project. Here, I build upon my previous article and present a #machinelearning-based variant calling pipeline to annotate and predict genetic variants in my own genome. My goal is to read and analyze my genetic "source code", which is constantly "interpreted" and "executed" by the molecular machinery in my cells. This code execution drives the biological processes that define who I am, so of course I am quite excited to see what is under the hood. But the decision to do this required some introspection.

The Pandora's Box: Facing Uncomfortable Truths

I had to weigh the pros and cons of sequencing my entire genome, because having such a deep knowledge can potentially come at a great cost. It can open a Pandora's Box of uncomfortable truths that can profoundly impact one's life, and there is no going back.

It is not difficult to imagine a dystopian world where our genetic source code determines our socio-economic standing. Having this knowledge raises ethical questions about privacy and discrimination. Genetic information can be (and arguably already is) misused by employers, insurance companies, or other parties to discriminate against individuals based on their genetic predispositions. One only needs to look at the horrific eugenics movements in the prior century, or the movie GATTACA (which is great by the way!) to see how this knowledge could be perverted and abused.

No alt text provided for this image
In GATTACA, there is a genetic registry database that society uses to classify individuals as "valid" or "invalid" based on their genetic code. "Valids" can go on to become doctors, lawyers, astronauts, etc. While invalids are relegated to dead-end jobs like being janitors. Image from https://www.youtube.com/watch?v=0FQdzUxFc8A

In addition, if I learned that I carry a gene for a severe or terminal condition, the decision to have children can become a difficult and deeply personal ethical dilemma. The knowledge of one's genetic makeup can transform family planning and introduce new complexities to the process. It may also impact my family members, who may face similar genetic risks or be carriers for certain diseases, which could cause great guilt or fear, as as I must grapple with the decision of whether or not it is right to disclose this information.

We may well be headed towards "GATTACA" or a "Brave New World", but even then, I don't think it makes sense to live in fear. What people often forget is that our genetic code is not completely deterministic. Even in GATTACA, the protagonist defies his pre-ordained fate, and despite his genetic code relegating him as an "invalid" by society, he exceeds all expectations through his unrelenting drive and willpower to live and achieve his dreams. I refuse to let any genetic defects hold me back if there are any. While I can't control everything that happens to me, I can still exert some influence through my environment and life choices. In addition, understanding my genetic traits can lead to great self-discovery and personal growth, as I can get unique insights into the characteristics, strengths, and weaknesses that make me who I am. And if I discover I am at risk of a disease, the foresight gives me an opportunity to treat it proactively, or take actions to mitigate the risk. Knowledge is power. After considering the pros and cons, I decided to proceed despite the risks.

The advent of next-generation sequencing (NGS) technologies has revolutionized the field of genomics and enabled researchers to obtain vast amounts of sequencing data at an unprecedented scale. However, the analysis and interpretation of NGS data require the implementation of robust bioinformatics pipelines to ensure the accuracy and reliability of the results. In essence, quality control is so important. Otherwise, it is "garbage in, garbage out".

One of the key applications of NGS is the identification of genomic variants, which has implications in various fields, such as personalized medicine, population genetics, and evolutionary biology.


Also, if you want to follow along with your own data, I have imbedded the code that I used on my own computer to run these commands, so if you have a mac, you can also follow along with your own data.

Step 1: Getting my genome sequenced

The first obvious step is to get my whole genome sequenced. In the consumer whole genome sequencing market, there are quite a few options available. It doesn't necessarily matter who you choose. However, I found that Dante Labs had the best turn-around time, and sometimes they offer decent discounts. At the time I bought the whole genome sequencing kit, they had a promotional sale for 350 Euros or something for sequencing at 30X read depth, which was decent.

No alt text provided for this image
Dante Labs offers service to get your whole genome sequenced. You can use them, or don't. Just find one at a good price that has decent reviews.

My sample collection method used a blood collection kit, but I think that currently, they use a saliva collection kit. I am a bit biased, but I think that the blood collection kits are a bit better as you are less prone to get lots of unwanted bacterial contamination in the sequenced sample, which is inconvenient and makes your sample more likely to fail quality control. There is at least some scientific evidence that this is true. It is annoying that they don't offer the blood collection kits anymore, but oh well.

No alt text provided for this image
The blood collection kit from Dante Labs worked very well. It is unfortunate that they stopped using it as of 06.05.2023. Source: https://www.dantelabs.com/pages/tasso-device

I waited about 6 weeks, and then finally got an email notification that I could download my sequence data. It was a bit disappointing that the genome was aligned to an old and outdated version of the reference sequence, so it was time to make my own pipeline from scratch.

Step 2: Raw read quality assessment

The first step was to make sure that the read quality was decent. I used a tool called FASTQC to do some quality assessment of the reads. The image below shows the distribution of quality scores at each base position across the short reads. A score of 20 means that there is a 1% chance that the machine calls an incorrect base at this position. I call up fastQC after installing it on my macbook pro using homebrew (which is an awesome linux-like package manager which I HIGHLY recommend using if you have a mac) and running it with this command:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install fastqc
fastqc -t 9 /path/to/my/reads1.fastq.gz        
No alt text provided for this image
A quality assessment of my raw sequencing data

Except for a couple of warnings, the quality of the sequencing data I got from Dante labs was pretty good. However, one prominent red flag came up below:

No alt text provided for this image

This warning basically says that there is a bias in the sequence content in the beginning and end of the reads. According to a helpful forum post, it seems to be a consequence of the chemistry that was used to prepare the DNA samples for sequencing, but shouldn't in theory have a huge impact. Still, it would be nice if Dante Labs used a better method to prepare my DNA samples for sequencing.

If you want to see what the full quality control file looks like, check below:

So while I was overall satisfied with the data, it was still a good idea to trim the reads a bit. Some bases at the end had lower quality values that fell below the recommended threshold.

So I proceeded with the trimming with settings recommended for illumina paired-end reads:

brew install bbtools
bbduk.sh -Xmx28g \# Set the maximum amount of memory to use to 28 gigabytes
in1='/Volumes/T7/Genomics/DIY_Alignment/FastQ_reads/Raw_Reads/TSAA1017_SA_L001_R1_001.fastq.gz' \# Path to fastq file 1
in2='/Volumes/T7/Genomics/DIY_Alignment/FastQ_reads/Raw_Reads/TSAA1017_SA_L001_R2_001.fastq.gz' \# Path to fastq file 2
out1='/Volumes/T7/Genomics/antonio_analysis_v2/1-Read-Trimming/output/trimmed_R1.fastq.gz' \# trimmed fastq file 1
out2='/Volumes/T7/Genomics/antonio_analysis_v2/1-Read-Trimming/output/trimmed_R2.fastq.gz' \# trimmed fastq file 2
ref='/Volumes/T7/Genomics/antonio_analysis_v2/adapters/all_adapters.fa' \# list of adapter sequences to trim if present
ktrim=r \# trim from right end of reads
k=23 \# K-mer length to use for finding adapters
mink=11 \# minimum k-mer length to use for finding adapters
hdist=1 \# Maximum Hamming distance to allow when matching k-mers to adapters
tpe \# trim poly-A and poly-T tails from reads if present
tbo \# trim adapters based on overlap only
qtrim=rl \# Trim read ends from both left and right
trimq=20 \# Trim bases with quality below 20.
minlen=36 \# Discard reads shorter than 36 bases after trimming.
threads=10 # Number of threads to use for processings        

I got the following results, which looked pretty good!:

No alt text provided for this image
No alt text provided for this image

The average read quality at almost every base position was very good now, but the per base sequence content at the ends was still suboptimal. But in theory, it shouldn't cause a lot of problems, so it should be safe to ignore it. Again, you can see the the results in more detail here:

Step 3: Aligning the reads to a reference sequence: Shredding an encyclopedia and putting it back together

So now that things looked pretty good, I was ready to perform the alignment. Creating a high quality alignment is the first step in the variant calling pipeline. This step is crucial for the effective identification of genomic variants and is typically composed of multiple stages, including quality control, alignment, duplicate removal, base quality recalibration, and filtering. When analyzing NGS data, one of the most widely used file formats for storing aligned sequence data is the Binary Alignment/Map (BAM) file format. BAM files are binary, compressed representations of the Sequence Alignment/Map (SAM) format, which efficiently store the alignment information of reads to a reference genome. There are so many different aligners that can be used to align reads to a reference genome, and while they all have different methodologies, they roughly follow a similar process.

This is a simplified explanation, but imagine that you have a fabulous rainbow sequence, as shown below. During the sample preparation, it is cut into small pieces, and each small piece is sequenced. The software needs to somehow align these smaller pieces back together in order to recreate the original sequence.

No alt text provided for this image
Source: https://old.abmgood.com/marketing/knowledge_base/next_generation_sequencing_data_analysis.php

To give you a better idea of how crazy this process is, imagine shredding an encyclopedia and putting it back together one word at a time. That is basically what we are doing. It is important to keep in mind that this method does have some important drawbacks. For example, in regions of my genome where there are large repetitive sequences, the aligner will struggle to properly place the reads, so it may compress those regions and I will miss any variants that might be in there. This can be somewhat mitigated by combining my short reads with longer reads from newer technologies like PacBio or Nanopore, which I might try doing in the future.

Performing the alignment

I decided to go with the gold standard, which is to align my raw reads to the CRhg38 reference genome using an aligner called BWA. It is made by some very famous smart dude named Heng Li who knows more about bioinformatics than I will ever know. Also, if you're interested in why I chose the particular version of the genome I did for the alignment (CRhg38), you can read more about it here.

brew install bwa samtools pigz # install the tools
curl -L ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz \
-o /Volumes/T7/Genomics/antonio_analysis_v2/reference/CRhg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz \
&& pigz -d /Volumes/T7/Genomics/antonio_analysis_v2/reference/CRhg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz\# decompress the file

# assign the variables
REFERENCE_GENOME="/Volumes/T7/Genomics/antonio_analysis_v2/reference/CRhg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"
FASTQ_FILE1="/Volumes/T7/Genomics/antonio_analysis_v2/1-Read-Trimming/output/trimmed_R1.fastq.gz"
FASTQ_FILE2="/Volumes/T7/Genomics/antonio_analysis_v2/1-Read-Trimming/output/trimmed_R2.fastq.gz"
OUTPUT="/Volumes/T7/Genomics/antonio_analysis_v2/2-Alignment/output_aligned.bam"

samtools faidx "$OUTPUT" # index the reference genome

bwa mem \# Perform the alignment
-t 9 \# number of processors to use
-M \# marks shorter split hits as secondary, which is appropriate for aligning reads to a large reference genome
-R "@RG\tID:group1\tPL:illumina\tSM:antonio\tLB:truseq\tPU:unit1" \ # Identifier tags for my sample
"$REFERENCE_GENOME" \ # the indexed reference genome 
"$FASTQ_FILE1" \
"$FASTQ_FILE2" | samtools view -bS - > "$OUTPUT" # pipe the resulting unsorted output to the compressed BAM file        

I have a 14-inch 10 core Macbook Pro M1 Max with 32 Gb of memory. This machine is a beast, but even for this computer, the alignment took 7.2 hours, but it finally finished.

Now we are ready to sort the alignment, do quality control to remove duplicates and poorly mapped reads, and do base score recalibration to correct for systemic errors introduced by the Illumina sequencing machine. I'll explain all of these things in another post.

In the meantime, if you need some help with genomics or sequence analysis, feel free to get in touch. In a couple of weeks, there will be the option to do some small-scale wet lab experiments, such as RNAseq, etc. in case you are looking to get some proof-of-concept data that can push you over the edge and get you into a synthetic biology accelerator program like #indiebio.

Thanks for reading!

Antonio

Evan Clayton

Director of Bioinformatics at SP2TX

1 年

Well done, Antonio. You absolutely have that GATTACA willpower to make life style changes. For what it’s worth, Dante sent me a blood collection kit just a few weeks ago.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了