Simple ML workflow to identify antiviral molecules

All Blockbusters have a Prequel....

COVID-19 is ravaging the world with high incidence, prevalence rates and a high but variable mortality. Companies, Individuals and non-profits have all taken to LinkedIn announcing various efforts to help customers, researchers in various ways. ML/AI oriented companies that have announced various initiatives. For now let's just assume they are all well intentioned and not mere "Hokum".

While my heart and mind are now devoted to a different domain , namely digital telco, in a former life, much of my ML work pertained to drug design and screening. Drug target and MOA/pharmacology prediction was my comfort zone. Thus, on a weekend with the family caught in the wiles of Netflix/Amazon Prime and their kin,, I dusted off my home laptop, updated my trusted Knime (www.knime.com) platform and built some very simple ML models for predicting NAY screening antiviral molecules. Nooooooooooooo don't judge me for not using python, command-line, the "CLOUD". It is but for the young to delve into those mysteries...

And Action....

Training data was obtained from CTD (https://ctdbase.org/downloads/#cd). Pre-processing was done using CDK nodes. It retained synthetic organic drugs, natural product derived ones and kept only the largest fragments among other clean-ups. Most Cheminformatics tasks rely indexing molecules into vectors, based on several schemes. All these vectoring /fingerprinting variants are available in CDK, Indigo and Rdkit. I spent Sunday evening using these fingerprints and their combinations alongside ML algorithms like Bayesian fingerprint learner, Random Forest and GB trees. But as is the bane of most ML modellers, the accuracy in terms of false positives/negatives was abysmal. It was then that I had a choice between numerical features(theoretical & experimental) that above libraries also calculate or giving fingerprints another try. (Decided in favour of the latter)

There is no Species called "One Shot ML modeller"

My initial ML workflow used a groupby loop to iterate through all ATC classes to build 'One vs All' models using binary classifiers so that I would have as many ML models as classes/response cardinality. This is important ,since one molecule can have multiple activities.Yes, as usual I had to deal with class imbalance, sampling trouble and what not!

Each code parts that allow us to classify the molecule but for our purpose ATC code (or corresponding description) is our response variable.

No alt text provided for this image

Didn't work , Didn't work ......Didn't work.... ML modelling is no easy business and its sometimes hard to understand whether your feature space is inadequate, training set has issues or something else entirely. If someone claims to be very successful at quickly building accurate ML models....ask them if they also believe in Astrology :-) ( How many readers have I annoyed , one wonders)

Like ML models my Spirit was finally Boosted!

What finally worked, was the trick of expanding binary vector (PubChem)and then using these columns as input to the ML model. Random forest with 10 trees as a multi-class classifier gave the best results. What was sacrificed in this scenario was prediction of multiple classes. So basically the model predicts only one class per molecule among all the ATC classes (https://en.wikipedia.org/wiki/Anatomical_Therapeutic_Chemical_Classification_System). It's adequate for now, since our purpose is only to predict antiviral classes, irrespective of whether it can play some other therapeutic role.

For evaluation, the same CTD db was input and predictions that corresponded to known antivirals, were filtered out. This gave us novel predictions (vis-à-vis the training set) like adenosine whose analogs have known antiviral activity or some other cardiovascular agents. Rudimentary workflow, no great shakes in terms of feature engineering, parameter optimization but a beginning.

No alt text provided for this image


I am making this workflow available (Download link) in the hope that you can screen larger data from other sources like Zinc and even refine the workflow.


Disclaimer: This is purely a personal effort to respond to COVID-19, with no connection to my present Employer, any other Company or Individual. I make no claim that the predictive model is accurate for clinical hypotheses or that it will find molecules active against COVID19.



Indira Ghosh

Professor at Jawaharlal Nehru University

4 年

Excellent Rajeev ,all path to attack this is needed, at least boost our moral!?

Sameer Chaudhary

CSO & Co-Founder at RASA Life Science Informatics

4 年

Got 3 compound for covid19 applied for Indian patent today will submit paper and proposal by next Monday after getting patent file no my contribution for India let's hope for the best

Tanmay Banerjee

Informatics Consultant at Biocon Bristol Myer Squibb Research Center

4 年

Did you predict before Trump?

KK Bhagchandani

CxO Mentor | Executive Presence Coach | Asian Business Consultant | CBO | Coach | Impactful leadership | IMC-CPM | Global Business Leader | Tech moderator| TEDx Speaker | Public Speaking Coach

4 年

Great stuff! Keep going, this is the right direction

Kamesh Janakiraman

Connecting People to Scientific Insights & Actionable Outcomes - Digital Transformation, Informatics Sales Leader

4 年

Awesome buddy! Excellent contribution

要查看或添加评论,请登录

Rajeev Gangal的更多文章

社区洞察

其他会员也浏览了