登录查看更多内容

AI in Computational Biology (Part 1)

Devesh Rajadhyax

AI startup founder, Author of ?? 'Decoding GPT'

发布日期: 2018年5月19日

(This article is a reproduction of the lectures I have given in Engineering Colleges in Mumbai, for students and faculty)

In a previous article, I have suggested the DTOP (Data-Technology-Objective-Possibilities) framework (see https://www.dhirubhai.net/pulse/4-hints-get-started-ai-your-company-devesh-rajadhyax)to analyse AI use cases in a company. Today I am going to use the same framework to explain some applications of AI to Computational Biology(CB).

In this article I will focus more on the data aspect. There is a reason. Biological data is probably the most important kind data for us. However, very few engineers have a good understanding of this data. I will be very happy if this article encourages some engineers to study biological data in detail. Understanding and experience with this data will become a much wanted skill in near future.

Let me first put CB use cases in DTOP.

Data: Biological data is of many types. I will mention three major types here and explain one of them in this part:

- Genetic data: data associated with our genome. There are 3 billion letters in our genome and we are just one of the hundreds of thousands of species. The structure of DNA was discovered by legendary scientists Crick and Watson in early 1950’s. The human genome was decoded in 2001. Since then, genome data has provided us many insights about medicine and life in general.

- Protein data: our body is almost entirely made up of proteins. The structure of each protein is unique and is a source of data. Understanding of the protein data is crucial for discovering medicines and treating patients.

- Medicine data: Medicines are chemicals, so they have a molecular structure. They interact with protein and make changes in what is called as gene expression. All this data becomes valuable for inventing new drugs.

Technology: The AI technologies available to Computational Biologists are mostly Machine Learning (both Supervised and Unsupervised) and Deep Learning. The major tasks that these techniques perform in CB are pattern detection, similarity and classification, among others. I will not write much about the techniques themselves as many excellent sources are available for studying them.

Objectives: Biology or medicine is a subject close to our heart and a number of objectives can be stated, however, most objectives will fall in three major buckets:

- Diagnosis: The protein and genetic data can be utilised for detecting what is wrong or even what will go wrong with the person in future

- Treatment: All three types of data are needed for coming up with new medicines. As of now, not much use of data is made to decide treatment plan for a patient, but this is the goal of personalised medicine.

- Research: In the universities and labs across the globe, research is being conducted that does not have much to do with humans, but will someday lead to better medicine. Research to understand functioning of genes common to all life is one such example. The genetic and protein data will be useful here, but not necessarily those of human beings.

Possibilities: Applying AI to the rich biological dataset can create many possibilities. It can point you to some chemicals as possible medicines, some genes may be identified as responsible for a function, the root cause of a disease may be identified or a person may be declared as susceptible to certain disorder. But before we understand the possibilities, we need a certain understanding of the underlying data.

Genetic data

Cell

Our body is made up of cells and nothing else. The cell is the stage for everything that happens to us and all living beings.

The cell is of course an unimaginably large source of data. But at present we are interested in only one type – the data that is encased by the cell nucleus. It is called DNA.

The DNA

A DNA is a very very large molecule that looks like a twisted rope ladder. It is made up of four types of chemicals called A, G, C and T, from their chemical names. The DNA is usually grouped as a set of ladder strings called chromosomes. Human beings have 23 pairs of them, 46 in all. The sequence of A, G, C, T is our data. Let’s go ahead and see what this data means.

The gene

The sequence on our DNA appears without beginning and end, but in reality it is divided in piece of letters called genes.

GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGTATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAGACACCATGAGAGCCCTCACACTCCTCGCCCTATTGGCCCTGGCCGCACTTTGCATCGCTGGCCAGGCAGGTGAGTGCCCCCACCTCCCCTCAGGCCGCATTGCAGTGGGGGCTGAGAGGAGGAAGCACCATGGCCCACCTCTTCTCACCCCTTTGGCTGGCAGTCCCTTTGCAGTCTAACCACCTTGTTGCAGGCTCAATCCATTTGCCCCAGCTCTGCCCTTGCAGAGGGAGAGGAGGGAAGAGCAAGCTGCCCGAGACGCAGGGGAAGGAGGATGAGGGCCCTGGGGATGAGCTGGGGTGAACCAGGCTCCCTTTCCTTTGCAGGTGCGAAGCCCAGCGGTGCAGAGTCCAGCAAAGGTGCAGGTATGAGGATGGACCTGATGGGTTCCTGGACCCTCCCCTCTCACCCTGGTCCCTCAGTCTCATTCCCCCACTCCTGCCACCTCCTGTCTGGCCATCAGGAAGGCCAGCCTGCTCCCCACCTGATCCTCCCAAACCCAGAGCCACCTGATGCCTGCCCCTCTGCTCCACAGCCTTTGTGTCCAAGCAGGAGGGCAGCGAGGTAGTGAAGAGACCCAGGCGCTACCTGTATCAATGGCTGGGGTGAGAGAAAAGGCAGAGCTGGGCCAAGGCCCTGCCTCTCCGGGATGGTCTGTGGGGGAGCTGCAGCAGGGAGTGGCCTCTCTGGGTTGTGGTGGGGGTACAGGCAGCCTGCCCTGGTGGGCACCCTGGAGCCCCATGTGTAGGGAGAGGAGGGATGGGCATTTTGCACGGGGGCTGATGCCACCACGTCGGGTGTCTCAGAGCCCCAGTCCCCTACCCGGATCCCCTGGAGCCCAGGAGGGAGGTGTGTGAGCTCAATCCGGACTGTGACGAGTTGGCTGACCACATCGGCTTTCAGGAGGCCTATCGGCGCTTCTACGGCCCGGTCTAGGGTGTCGCTCTGCTGGCCTGGCCGGCAACCCCAGTTCTGCTCCTCTCCAGGCACCCTTCTTTCCTCTTCCCCTTGCCCTTGCCCTGACCTCCCAGCCCTATGGATGTGGGGTCCCCATCATCCCAGCTGCTCCCAAATAAACTCCAGAAG

(The sequence for HSBGPG - Human gene for bone gla protein (BGP)

There is no standard length for a gene. One gene is a recipe for making one protein. We will now see how.

Amino Acids

All living beings are made up of primitive compounds called Amino Acids. As of last count, there are 21 of them. Which means the entire life you see around you is made up of just 21 chemicals. That's some amazing modularity that makes data scientists happy.

Each amino acid is represented by three letters on DNA. Examples:

AGC – Serine

GCA – Alanine

See more for yourself:

(In a little quirk of biology, T is replaced by U, but I will tell you that one later when we learn about RNA)

An amino acid is an organic compound, which means it is made from carbon. This is why they say that life on earth is carbon based.

This is how amino acids look:

Where R represents a group of atoms, different for each amino acid.

Amino Acids are the building blocks of proteins. That's the next link in this explanation.

Proteins

Proteins are long sequences of amino acids, turned and twisted in a particular way.

The figure on right side is representation that shows various substructures in a protein, such as coils, wires etc. We should keep in mind that all these structures are made by twisting and turning of amino acid sequences and the colours and shapes are purely representational.

So now you know that a gene is an instruction set for making one protein. It is almost as if says – ‘put some Alanine, add a pinch of Glycine,….., twist, twist, …, turn’. But how does the manufacturing happen? That brings us to the RNA and some messaging.

The manufacturing takes place in microscopic machines inside the cells called ribosomes. When a gene is to be made into a protein, a copy of the sequence of that gene is taken on a material called the RNA. There are just two differences in DNA and RNA – a) the chemical T is replaced by the chemical U and b) it has just one strand, one side of the ladder.

This copy of gene, called messenger RNA or mRNA travels to the ribosome. Ribosome is like a machine in a plastic factory. From one side a generous supply of amino acids is fed to it. It then connects them as per the sequence in the mRNA, twists and turns as per the instructions and sends a manufactured protein molecule from the other side.

When in a particular cell, a gene causes its protein to be manufactured, the gene is said to be ‘expressed’. This gives rise to gene expression data, which basically means which genes are expressed in a particular cell. The gene expression data is increasing being used is detecting many disorders and discovering new drugs.

Well, I think that is enough content for one article. Allow me to break this into a series and discuss some use cases based on genetic data in the next part. I have two in mind – identifying gene functions and repositioning drugs.

Mark Williams

Insurance Law Specialist | Public Liability | Professional Indemnity | Life Insurance | Defamation Lawyer

6 年

I hear about this all the time! Great article, very informative.

1 次回应

Premkumar Thewar

Student at KJ Somaiya College of Engineering, Vidyavihar

6 年

A very good article Sir!!! Expecting more such articles as they can unveil a lot of information in CB.

1 次回应

Dr. Yogesh Sonavane

Founder I Biomaterial Scientist | Entrepreneur | Inventor-Innovator I Alumni Max Planck Institute I Visiting Research Scientist@TU Hamburg Germany

6 年

Very impressive !!

1 次回应

查看更多评论

要查看或添加评论，请登录

Devesh Rajadhyax的更多文章

Should companies train their own LLM?

2024年8月4日

Should companies train their own LLM?

Enterprises all over the globe have started using Generative AI. They are using it for improving their communication…

7 条评论
Students doing homework with ChatGPT is a non-issue

2023年5月5日

Students doing homework with ChatGPT is a non-issue

A few weeks ago, I wrote an article claiming that the impact of automatic code generation by ChatGPT is much less than…

4 条评论
ChatGPT's code generation will not impact IT industry

2023年4月9日

ChatGPT's code generation will not impact IT industry

Many of my conversations in the last few days revolved around ChatGPT. This is hardly surprising, given the impact the…

3 条评论
AI - A story of four games

2018年6月22日

AI - A story of four games

AI is not new; it has a 60-year history. It has seen many ups and downs.

8 条评论
Is your company ready for Predictive Analytics?

2018年6月4日

Is your company ready for Predictive Analytics?

Every business leader has now become aware of the application of AI/ML to Predictive Analytics. They would like to…

4 条评论
4 hints to get started with AI in your company

2018年5月4日

4 hints to get started with AI in your company

Most companies are working on Digital Transformation today, and Artificial Intelligence is a critical part of that…

11 条评论
Four ways in which AI can help humankind

2017年9月3日

Four ways in which AI can help humankind

Artificial Intelligence is receiving more than its fair share of public attention. On one side there are promises of…

8 条评论
Desperately needed: An Indian AI giant

2017年7月14日

Desperately needed: An Indian AI giant

We urgently need an Indian company of large size, like TCS or Infosys, or even Flipkart, focusing on Indian AI. What is…

13 条评论
The world of chatbots

2016年8月18日

The world of chatbots

Chatbots are becoming quiet a phenomenon. They are AI's flagship demonstrations.
Elon Musk's unlikely competitor - ISRO!

2016年5月27日

Elon Musk's unlikely competitor - ISRO!

It was only six months ago that I wrote about a rivalry in making, Elon Musk v/s Jeff Bezos. https://www.

1 条评论

See all articles

AI in Computational Biology (Part 1)

Devesh Rajadhyax

AI startup founder, Author of ?? 'Decoding GPT'

Genetic data

Cell

The DNA

The gene

Amino Acids

Proteins

Devesh Rajadhyax的更多文章

社区洞察

其他会员也浏览了

LLMs (GPT, Claude), Diffusion Models & Quantum Applications in Chemistry Research: A Comprehensive Review of AI Methods, Tools & Future Directions

Blending Biology and AI: Dr. Markus Gershater on the Future of Life Sciences

Unlocking Biological Mysteries with Physics-Guided AI

Understanding Technology Convergence: a Case of Moderna

A New One-stop LLM for Chemical and Biomedical Tasks

Introduction to Evolutionary Algorithms: Genetic Algorithm, Neuro-Evolution

Neuro-Evolution of Augmenting Topologies Algorithm

Cross-Over and Genetic Mutation: The Evolutionary Odyssey of Artificial Intelligence

AI and the Nobel Prize: Navigating Hype, Reality, and Responsible Recognition

AI Wins Nobel Prize... Again!

Genetic data

Cell

The DNA

The gene

Amino Acids

Proteins

Devesh Rajadhyax的更多文章

Should companies train their own LLM?

Students doing homework with ChatGPT is a non-issue

ChatGPT's code generation will not impact IT industry

AI - A story of four games

Is your company ready for Predictive Analytics?

4 hints to get started with AI in your company

Four ways in which AI can help humankind

Desperately needed: An Indian AI giant

The world of chatbots

Elon Musk's unlikely competitor - ISRO!

社区洞察

其他会员也浏览了

LLMs (GPT, Claude), Diffusion Models & Quantum Applications in Chemistry Research: A Comprehensive Review of AI Methods, Tools & Future Directions

Blending Biology and AI: Dr. Markus Gershater on the Future of Life Sciences

Unlocking Biological Mysteries with Physics-Guided AI

Understanding Technology Convergence: a Case of Moderna

A New One-stop LLM for Chemical and Biomedical Tasks

Introduction to Evolutionary Algorithms: Genetic Algorithm, Neuro-Evolution

Neuro-Evolution of Augmenting Topologies Algorithm

Cross-Over and Genetic Mutation: The Evolutionary Odyssey of Artificial Intelligence

AI and the Nobel Prize: Navigating Hype, Reality, and Responsible Recognition

AI Wins Nobel Prize... Again!