登录查看更多内容

A Beginner's Guide to BioPython

Sandhiya Ravi, Ph.D

Postdoctoral Researcher

发布日期: 2023年7月4日

Wait…

Is it a Programming Language or a Library or a Package ? Yup, confusions rises when we read the term BioPython..What's BioPython exactly ?

-Its a library we can use to analyze bioinformatic data!!

Let’s grab few codes in BioPython..

#Importing Bio Python

import numpy as np 
import pandas as pd 
import Bio
print("Biopython v" + Bio.__version__)

Where do we use BioPython ? Yes you got it right ! Mostly it’s used with Sequences.

#to play with sequences create an Seq object

from Bio.Seq import Seq
my_seq = Seq("ATGACACTGGT")
print(my_seq)

#Basic operations that can be done on a plain sequence:

print(my_seq + " - Sequence")
print(my_seq.complement() + " - Complement")
print(my_seq.reverse_complement() + " - Reverse Complement")
#Output:
ATGACACTGGT - Sequence
TACTGTGACCA - Complement
ACCAGTGTCAT - Reverse Complement

Technical Buzz:

Sequence Record[SeqRecord]:

It holds the sequence with extra information like identifier, name of the sequence and it’s description.

2. Bio.SeqIO module :

This module is useful in reading and writing sequence file formats .

3. Files :

Its where most of the biological data is stored. Parsing such files into a format is an interesting task .

Few basic cup cakes with Biopython:

a. len(sequences)

#let's consider the following sequence :
chr1 = 'Cgacaatgcacgacagaggaagcagaacagatatttagattgcctctcattttc'
len(chr1)
#output:
54

b. access elements of the sequence in the same way as for strings

print("First Letter: " + chr1[0])

#output:
First Letter: C

c. count() method:

print("AGCTAACT".count("AA"))
print(Seq("AAAA").count("AA"))
print("AATATATA".count("T"))
#output :
1
2
3

d. GC Content: this decides how stable the molecule will be?

领英推荐

Script Tip Friday- Named Tuples

Ansys Structures 2 年前

HOW TO CREATE A COMPUTER VISION DATASET FROM VIDEO IN R

Kristen Kehrer 2 年前

Data Science #20

Andriy Burkov 1 年前

#manually calculating the GC content

print("GgCcSs%:\t" + str(100 * float((chr1.count("G") + chr1.count("g") + chr1.count("C") + chr1.count("c") + chr1.count("S") + chr1.count("s") ) / len(chr1) ) ))
#output:
GgCcSs%:	42.592592592592595

#using in-built function:
print("GC% Package:\t" + str(GC(chr1)))

#output:
GC% Package:	42.592592592592595

Why do we have ‘S’ in the manual calculation of GC Content ?

It’s because we use capital G/C characters and lowercase g/c characters and in addition, there are also S and s characters which represent an ambiguous G OR C character — but which are being counted for GC content by default by the package.

e. Slicing Sequences:

print(chr1[4:12])
#output:
aatgcacg

f. SHORT sequences:

We can use a short subset of chr1 since we don’t want to go printing millions of characters — it works like slicing sequences with the sequences.

chr1SHORT = chr1[0:20]
print("Short chr2L: " + chr1SHORT)
print("Codon Pos 1: " + chr1SHORT[0::4])
print("Codon Pos 2: " + chr1SHORT[1::4])
print("Codon Pos 3: " + chr1SHORT[2::4])
#output:
Short chr1: Cgacaatgcacgacagagga
Codon Pos 1: Cacaa
Codon Pos 2: gaacg
Codon Pos 3: atcag

g. Reversing the sequence using -1 slicing :

print("Reversed: " + chr1SHORT[::-1])
#output:
Reversed: aggagacagcacgtaacagC

f. Concatenating Sequences :

It’s as simple as combining strings in Python.

seq1 = 'ATGACTTTATAT'
seq2 ='ATGCGCTGCTTT'
concatenated_seq = seq1 + seq2
print(concatenated_seq)
#output:
ATGACTTTATATATGCGCTGCTTT

How do we add many sequences together?

Always remember when the task is for many we can rely on “for loop”

list_of_seqs = ["ATTA", "AGCT", "CTAT"]
concatenated = Seq("")
for s in list_of_seqs:
    concatenated += s
print(concatenated)
#output:
ATTAAGCTCTAT

The same concatenation can be done using the in-build function called ‘sum’

list_of_seqs =["ATTA", "AGCT", "CTAT"] 
print(sum(list_of_seqs, Seq("")))
#output
ATTAAGCTCTAT

g. Changing Cases:

dna_seq = Seq("gctATATA")
print("Original: " + dna_seq)
print("Upper: " + dna_seq.upper())
print("Lower: " + dna_seq.lower())
#output:
Original: gctATATA
Upper: GCTATATA
Lower: gctatata

This is just a small glimpse of what you can achieve with BioPython.

BioPython’s flexibility and wide range of features make it a powerful tool for anyone involved in computational biology or bioinformatics. To explore more, visit the official BioPython documentation and try out the tutorials and examples given. As with any new tool, the best way to learn is by doing. Happy coding!

Reference:

Biopython Tutorial and Cookbook

The Biopython Project

要查看或添加评论，请登录

Sandhiya Ravi, Ph.D的更多文章

Machine Learning vs Deep Learning: A Simple Guide for Biologists

2023年9月7日

Machine Learning vs Deep Learning: A Simple Guide for Biologists

Introduction In the rapidly evolving world of computational biology, it's essential to keep up with the latest…
Code Llama !

2023年8月26日

Code Llama !

This blog aims to introduce a general audience to the recent launch of Meta's Code Llama, a cutting-edge AI tool…

A Beginner's Guide to BioPython

Sandhiya Ravi, Ph.D

Postdoctoral Researcher

领英推荐

Sandhiya Ravi, Ph.D的更多文章

社区洞察

其他会员也浏览了

Numpy

Mary Algoritm

Understanding Shape and Dimension Compatibility in NumPy

Validation of a short term parametric trading model with genetic optimization and walk forward analysis with python

The Algorithm

Write a python program to detect anomalies in a Solar PV power plant.

Python For Kids (Part 22: Float Primitive Data Type)

Predict Time Series Data using GMDH Method in Python in 2 minutes

How to grab object dimensions from an image!

Operations with NumPy Arrays

领英推荐

Sandhiya Ravi, Ph.D的更多文章

Machine Learning vs Deep Learning: A Simple Guide for Biologists

Code Llama !

社区洞察

其他会员也浏览了

Numpy

Mary Algoritm

Understanding Shape and Dimension Compatibility in NumPy

Validation of a short term parametric trading model with genetic optimization and walk forward analysis with python

The Algorithm

Write a python program to detect anomalies in a Solar PV power plant.

Python For Kids (Part 22: Float Primitive Data Type)

Predict Time Series Data using GMDH Method in Python in 2 minutes

How to grab object dimensions from an image!

Operations with NumPy Arrays