A Beginner's Guide to BioPython

A Beginner's Guide to BioPython

Wait…

Is it a Programming Language or a Library or a Package ? Yup, confusions rises when we read the term BioPython..What's BioPython exactly ?

-Its a library we can use to analyze bioinformatic data!!

Let’s grab few codes in BioPython..

#Importing Bio Python


import numpy as np 
import pandas as pd 
import Bio
print("Biopython v" + Bio.__version__)        

Where do we use BioPython ? Yes you got it right ! Mostly it’s used with Sequences.

#to play with sequences create an Seq object

from Bio.Seq import Seq
my_seq = Seq("ATGACACTGGT")
print(my_seq)        

#Basic operations that can be done on a plain sequence:

print(my_seq + " - Sequence")
print(my_seq.complement() + " - Complement")
print(my_seq.reverse_complement() + " - Reverse Complement")
#Output:
ATGACACTGGT - Sequence
TACTGTGACCA - Complement
ACCAGTGTCAT - Reverse Complement        

Technical Buzz:

  1. Sequence Record[SeqRecord]:

It holds the sequence with extra information like identifier, name of the sequence and it’s description.

2. Bio.SeqIO module :

This module is useful in reading and writing sequence file formats .

3. Files :

Its where most of the biological data is stored. Parsing such files into a format is an interesting task .

Few basic cup cakes with Biopython:

a. len(sequences)

#let's consider the following sequence :
chr1 = 'Cgacaatgcacgacagaggaagcagaacagatatttagattgcctctcattttc'
len(chr1)
#output:
54        

b. access elements of the sequence in the same way as for strings

print("First Letter: " + chr1[0])

#output:
First Letter: C        

c. count() method:

print("AGCTAACT".count("AA"))
print(Seq("AAAA").count("AA"))
print("AATATATA".count("T"))
#output :
1
2
3        

d. GC Content: this decides how stable the molecule will be?

#manually calculating the GC content

print("GgCcSs%:\t" + str(100 * float((chr1.count("G") + chr1.count("g") + chr1.count("C") + chr1.count("c") + chr1.count("S") + chr1.count("s") ) / len(chr1) ) ))
#output:
GgCcSs%:	42.592592592592595

#using in-built function:
print("GC% Package:\t" + str(GC(chr1)))

#output:
GC% Package:	42.592592592592595        

Why do we have ‘S’ in the manual calculation of GC Content ?

It’s because we use capital G/C characters and lowercase g/c characters and in addition, there are also S and s characters which represent an ambiguous G OR C character — but which are being counted for GC content by default by the package.

e. Slicing Sequences:

print(chr1[4:12])
#output:
aatgcacg        

f. SHORT sequences:

We can use a short subset of chr1 since we don’t want to go printing millions of characters — it works like slicing sequences with the sequences.

chr1SHORT = chr1[0:20]
print("Short chr2L: " + chr1SHORT)
print("Codon Pos 1: " + chr1SHORT[0::4])
print("Codon Pos 2: " + chr1SHORT[1::4])
print("Codon Pos 3: " + chr1SHORT[2::4])
#output:
Short chr1: Cgacaatgcacgacagagga
Codon Pos 1: Cacaa
Codon Pos 2: gaacg
Codon Pos 3: atcag        

g. Reversing the sequence using -1 slicing :

print("Reversed: " + chr1SHORT[::-1])
#output:
Reversed: aggagacagcacgtaacagC        

f. Concatenating Sequences :

It’s as simple as combining strings in Python.

seq1 = 'ATGACTTTATAT'
seq2 ='ATGCGCTGCTTT'
concatenated_seq = seq1 + seq2
print(concatenated_seq)
#output:
ATGACTTTATATATGCGCTGCTTT        

How do we add many sequences together?

Always remember when the task is for many we can rely on “for loop”

list_of_seqs = ["ATTA", "AGCT", "CTAT"]
concatenated = Seq("")
for s in list_of_seqs:
    concatenated += s
print(concatenated)
#output:
ATTAAGCTCTAT        

The same concatenation can be done using the in-build function called ‘sum’

list_of_seqs =["ATTA", "AGCT", "CTAT"] 
print(sum(list_of_seqs, Seq("")))
#output
ATTAAGCTCTAT        

g. Changing Cases:

dna_seq = Seq("gctATATA")
print("Original: " + dna_seq)
print("Upper: " + dna_seq.upper())
print("Lower: " + dna_seq.lower())
#output:
Original: gctATATA
Upper: GCTATATA
Lower: gctatata        

This is just a small glimpse of what you can achieve with BioPython.

BioPython’s flexibility and wide range of features make it a powerful tool for anyone involved in computational biology or bioinformatics. To explore more, visit the official BioPython documentation and try out the tutorials and examples given. As with any new tool, the best way to learn is by doing. Happy coding!

Reference:

Biopython Tutorial and Cookbook

The Biopython Project

要查看或添加评论,请登录

Sandhiya Ravi, Ph.D的更多文章

  • Machine Learning vs Deep Learning: A Simple Guide for Biologists

    Machine Learning vs Deep Learning: A Simple Guide for Biologists

    Introduction In the rapidly evolving world of computational biology, it's essential to keep up with the latest…

  • Code Llama !

    Code Llama !

    This blog aims to introduce a general audience to the recent launch of Meta's Code Llama, a cutting-edge AI tool…

社区洞察

其他会员也浏览了