登录查看更多内容

Simple ML workflow to identify antiviral molecules

Rajeev Gangal

发布日期: 2020年3月23日

All Blockbusters have a Prequel....

COVID-19 is ravaging the world with high incidence, prevalence rates and a high but variable mortality. Companies, Individuals and non-profits have all taken to LinkedIn announcing various efforts to help customers, researchers in various ways. ML/AI oriented companies that have announced various initiatives. For now let's just assume they are all well intentioned and not mere "Hokum".

While my heart and mind are now devoted to a different domain , namely digital telco, in a former life, much of my ML work pertained to drug design and screening. Drug target and MOA/pharmacology prediction was my comfort zone. Thus, on a weekend with the family caught in the wiles of Netflix/Amazon Prime and their kin,, I dusted off my home laptop, updated my trusted Knime (www.knime.com) platform and built some very simple ML models for predicting NAY screening antiviral molecules. Nooooooooooooo don't judge me for not using python, command-line, the "CLOUD". It is but for the young to delve into those mysteries...

And Action....

Training data was obtained from CTD (https://ctdbase.org/downloads/#cd). Pre-processing was done using CDK nodes. It retained synthetic organic drugs, natural product derived ones and kept only the largest fragments among other clean-ups. Most Cheminformatics tasks rely indexing molecules into vectors, based on several schemes. All these vectoring /fingerprinting variants are available in CDK, Indigo and Rdkit. I spent Sunday evening using these fingerprints and their combinations alongside ML algorithms like Bayesian fingerprint learner, Random Forest and GB trees. But as is the bane of most ML modellers, the accuracy in terms of false positives/negatives was abysmal. It was then that I had a choice between numerical features(theoretical & experimental) that above libraries also calculate or giving fingerprints another try. (Decided in favour of the latter)

There is no Species called "One Shot ML modeller"

My initial ML workflow used a groupby loop to iterate through all ATC classes to build 'One vs All' models using binary classifiers so that I would have as many ML models as classes/response cardinality. This is important ,since one molecule can have multiple activities.Yes, as usual I had to deal with class imbalance, sampling trouble and what not!

Each code parts that allow us to classify the molecule but for our purpose ATC code (or corresponding description) is our response variable.

Didn't work , Didn't work ......Didn't work.... ML modelling is no easy business and its sometimes hard to understand whether your feature space is inadequate, training set has issues or something else entirely. If someone claims to be very successful at quickly building accurate ML models....ask them if they also believe in Astrology :-) ( How many readers have I annoyed , one wonders)

Like ML models my Spirit was finally Boosted!

What finally worked, was the trick of expanding binary vector (PubChem)and then using these columns as input to the ML model. Random forest with 10 trees as a multi-class classifier gave the best results. What was sacrificed in this scenario was prediction of multiple classes. So basically the model predicts only one class per molecule among all the ATC classes (https://en.wikipedia.org/wiki/Anatomical_Therapeutic_Chemical_Classification_System). It's adequate for now, since our purpose is only to predict antiviral classes, irrespective of whether it can play some other therapeutic role.

For evaluation, the same CTD db was input and predictions that corresponded to known antivirals, were filtered out. This gave us novel predictions (vis-à-vis the training set) like adenosine whose analogs have known antiviral activity or some other cardiovascular agents. Rudimentary workflow, no great shakes in terms of feature engineering, parameter optimization but a beginning.

I am making this workflow available (Download link) in the hope that you can screen larger data from other sources like Zinc and even refine the workflow.

Disclaimer: This is purely a personal effort to respond to COVID-19, with no connection to my present Employer, any other Company or Individual. I make no claim that the predictive model is accurate for clinical hypotheses or that it will find molecules active against COVID19.

Indira Ghosh

Professor at Jawaharlal Nehru University

4 年

Excellent Rajeev ,all path to attack this is needed, at least boost our moral!?

1 次回应

Sameer Chaudhary

CSO & Co-Founder at RASA Life Science Informatics

4 年

Got 3 compound for covid19 applied for Indian patent today will submit paper and proposal by next Monday after getting patent file no my contribution for India let's hope for the best

1 次回应

Tanmay Banerjee

Informatics Consultant at Biocon Bristol Myer Squibb Research Center

4 年

Did you predict before Trump?

2 次回应

KK Bhagchandani

4 年

Great stuff! Keep going, this is the right direction

2 次回应

Kamesh Janakiraman

Connecting People to Scientific Insights & Actionable Outcomes - Digital Transformation, Informatics Sales Leader

4 年

Awesome buddy! Excellent contribution

2 次回应

查看更多评论

要查看或添加评论，请登录

Rajeev Gangal的更多文章

AMA: About Me Anywho/Ask me Anything

2021年11月30日

AMA: About Me Anywho/Ask me Anything

Ask me Anything: A non-celebrity answers oft-asked questions . While my previous articles and posts have focused on…

7 条评论
AIFeynman: Attempt 1

2021年5月30日

AIFeynman: Attempt 1

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Mon May 24 16:53:40 2021 @author: rajeevgangal @VOIS…
AIFeynman: Attempt 2 partial success

2021年5月30日

AIFeynman: Attempt 2 partial success

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Sat May 29 16:21:00 2021 @author: rajeevgangal """ """…

3 条评论
Churn dataset Image generation code

2021年1月10日

Churn dataset Image generation code

# -*- coding: utf-8 -*- """ Created on Wed Jan 6 23:09:40 2021 @author: Rajee """ # -*- coding: utf-8 -*- """ Created…

5 条评论
keras-tensorflow code for Telecom Customer churn modelling

2020年12月29日

keras-tensorflow code for Telecom Customer churn modelling

# -*- coding: utf-8 -*- """ Created on Sat Nov 28 23:43:13 2020 @author: Rajeev """ import os…
PoS Regex Pattern matching for ML based Tagging

2020年8月5日

PoS Regex Pattern matching for ML based Tagging

using CSV using DataFrames global myfile =string(ARGS[1]) Input PoS files for feature calculation e.g.
PoS Word Scoring using Corpus Occurrence frequencies

2020年8月5日

PoS Word Scoring using Corpus Occurrence frequencies

using DataFrames using CSV englishfreqdf= CSV.read("/home/rajeev/myProjects/engletterfreq.
Python code for Messaging using Paho, MQTT

2020年6月28日

Python code for Messaging using Paho, MQTT

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Sat Jun 27 17:09:00 2020 @author: rajeevgangal """ import…
Homeopathy, Common Sense, Rationality & Belief!

2020年5月18日

Homeopathy, Common Sense, Rationality & Belief!

I have relatives, good friends and acquaintances who believe in homeopathy. Just like I have ones who believe in…

4 条评论
Analytics in Sport. (I hereby anoint the field---Passionalytics!)

2019年11月3日

Analytics in Sport. (I hereby anoint the field---Passionalytics!)

One of the perks of being an analytics professional, is to be able to splurge on interesting gadgets under the guise of…

1 条评论

See all articles

Simple ML workflow to identify antiviral molecules

Rajeev Gangal

All Blockbusters have a Prequel....

And Action....

There is no Species called "One Shot ML modeller"

Like ML models my Spirit was finally Boosted!

Disclaimer: This is purely a personal effort to respond to COVID-19, with no connection to my present Employer, any other Company or Individual. I make no claim that the predictive model is accurate for clinical hypotheses or that it will find molecules active against COVID19.

Rajeev Gangal的更多文章

社区洞察

其他会员也浏览了

AI-Driven Protein Models in Biotechnology: A Transformative Frontier

Science in Lab | How to Work with Precious RNA

Data Compliance in Precision Medicine: A Guide for OMICS Experts ??

The Genesis Machine: Our Quest to Rewrite Life in the Age of Synthetic Biology

Artificial Intelligence in the Discovery and Development of Drugs

Are you looking for something? I probably threw it out ???

Impact of AI on Omics & CDMO’s

From Insight to Impact: AI Propels Biopharma Research into a New Era of Precision and Scalability

AI in Biobanking: Revolutionizing Biological Sample Management and Research

All Blockbusters have a Prequel....

And Action....

There is no Species called "One Shot ML modeller"

Like ML models my Spirit was finally Boosted!

Disclaimer: This is purely a personal effort to respond to COVID-19, with no connection to my present Employer, any other Company or Individual. I make no claim that the predictive model is accurate for clinical hypotheses or that it will find molecules active against COVID19.

Rajeev Gangal的更多文章

AMA: About Me Anywho/Ask me Anything

AIFeynman: Attempt 1

AIFeynman: Attempt 2 partial success

Churn dataset Image generation code

keras-tensorflow code for Telecom Customer churn modelling

PoS Regex Pattern matching for ML based Tagging

PoS Word Scoring using Corpus Occurrence frequencies

Python code for Messaging using Paho, MQTT

Homeopathy, Common Sense, Rationality & Belief!

Analytics in Sport. (I hereby anoint the field---Passionalytics!)

社区洞察

其他会员也浏览了

AI-Driven Protein Models in Biotechnology: A Transformative Frontier

Science in Lab | How to Work with Precious RNA

Data Compliance in Precision Medicine: A Guide for OMICS Experts ??

The Genesis Machine: Our Quest to Rewrite Life in the Age of Synthetic Biology

Artificial Intelligence in the Discovery and Development of Drugs

Are you looking for something? I probably threw it out ???

Impact of AI on Omics & CDMO’s

From Insight to Impact: AI Propels Biopharma Research into a New Era of Precision and Scalability

AI in Biobanking: Revolutionizing Biological Sample Management and Research