A Beginner's Guide to BioPython
Wait…
Is it a Programming Language or a Library or a Package ? Yup, confusions rises when we read the term BioPython..What's BioPython exactly ?
-Its a library we can use to analyze bioinformatic data!!
Let’s grab few codes in BioPython..
#Importing Bio Python
import numpy as np
import pandas as pd
import Bio
print("Biopython v" + Bio.__version__)
Where do we use BioPython ? Yes you got it right ! Mostly it’s used with Sequences.
#to play with sequences create an Seq object
from Bio.Seq import Seq
my_seq = Seq("ATGACACTGGT")
print(my_seq)
#Basic operations that can be done on a plain sequence:
print(my_seq + " - Sequence")
print(my_seq.complement() + " - Complement")
print(my_seq.reverse_complement() + " - Reverse Complement")
#Output:
ATGACACTGGT - Sequence
TACTGTGACCA - Complement
ACCAGTGTCAT - Reverse Complement
Technical Buzz:
It holds the sequence with extra information like identifier, name of the sequence and it’s description.
2. Bio.SeqIO module :
This module is useful in reading and writing sequence file formats .
3. Files :
Its where most of the biological data is stored. Parsing such files into a format is an interesting task .
Few basic cup cakes with Biopython:
a. len(sequences)
#let's consider the following sequence :
chr1 = 'Cgacaatgcacgacagaggaagcagaacagatatttagattgcctctcattttc'
len(chr1)
#output:
54
b. access elements of the sequence in the same way as for strings
print("First Letter: " + chr1[0])
#output:
First Letter: C
c. count() method:
print("AGCTAACT".count("AA"))
print(Seq("AAAA").count("AA"))
print("AATATATA".count("T"))
#output :
1
2
3
d. GC Content: this decides how stable the molecule will be?
领英推荐
#manually calculating the GC content
print("GgCcSs%:\t" + str(100 * float((chr1.count("G") + chr1.count("g") + chr1.count("C") + chr1.count("c") + chr1.count("S") + chr1.count("s") ) / len(chr1) ) ))
#output:
GgCcSs%: 42.592592592592595
#using in-built function:
print("GC% Package:\t" + str(GC(chr1)))
#output:
GC% Package: 42.592592592592595
Why do we have ‘S’ in the manual calculation of GC Content ?
It’s because we use capital G/C characters and lowercase g/c characters and in addition, there are also S and s characters which represent an ambiguous G OR C character — but which are being counted for GC content by default by the package.
e. Slicing Sequences:
print(chr1[4:12])
#output:
aatgcacg
f. SHORT sequences:
We can use a short subset of chr1 since we don’t want to go printing millions of characters — it works like slicing sequences with the sequences.
chr1SHORT = chr1[0:20]
print("Short chr2L: " + chr1SHORT)
print("Codon Pos 1: " + chr1SHORT[0::4])
print("Codon Pos 2: " + chr1SHORT[1::4])
print("Codon Pos 3: " + chr1SHORT[2::4])
#output:
Short chr1: Cgacaatgcacgacagagga
Codon Pos 1: Cacaa
Codon Pos 2: gaacg
Codon Pos 3: atcag
g. Reversing the sequence using -1 slicing :
print("Reversed: " + chr1SHORT[::-1])
#output:
Reversed: aggagacagcacgtaacagC
f. Concatenating Sequences :
It’s as simple as combining strings in Python.
seq1 = 'ATGACTTTATAT'
seq2 ='ATGCGCTGCTTT'
concatenated_seq = seq1 + seq2
print(concatenated_seq)
#output:
ATGACTTTATATATGCGCTGCTTT
How do we add many sequences together?
Always remember when the task is for many we can rely on “for loop”
list_of_seqs = ["ATTA", "AGCT", "CTAT"]
concatenated = Seq("")
for s in list_of_seqs:
concatenated += s
print(concatenated)
#output:
ATTAAGCTCTAT
The same concatenation can be done using the in-build function called ‘sum’
list_of_seqs =["ATTA", "AGCT", "CTAT"]
print(sum(list_of_seqs, Seq("")))
#output
ATTAAGCTCTAT
g. Changing Cases:
dna_seq = Seq("gctATATA")
print("Original: " + dna_seq)
print("Upper: " + dna_seq.upper())
print("Lower: " + dna_seq.lower())
#output:
Original: gctATATA
Upper: GCTATATA
Lower: gctatata
This is just a small glimpse of what you can achieve with BioPython.
BioPython’s flexibility and wide range of features make it a powerful tool for anyone involved in computational biology or bioinformatics. To explore more, visit the official BioPython documentation and try out the tutorials and examples given. As with any new tool, the best way to learn is by doing. Happy coding!
Reference: