登录查看更多内容

Generating Diverse Molecules

Matthew Clark

Vice President of Data Science

发布日期: 2022年2月17日

In our AI projects for chemistry and retrosynthesis we often need sets of diverse molecules as comparators of diversity, training molecular transformers, for in-silico screening, and as general examples.

So we created a python program to generate any arbitrary number of random valid organic molecules within a given range of heavy atoms and list of organic elements.

This molecule stream could be used with AI models such as the recently reported EquiBind from MIT that can quickly evaluate molecules to estimate target binding

While enumeration of structures has been done to create a large database (Reymond 2012), and to evaluate diversity (Lipkus 2008). Random generation work has been reported using the commercial tool "Molgen". storing and managing large databases can be inconvenient. It can be more convenient to generate a stream "on demand" whenever a large number of diverse molecules is needed. The Reymond paper has been cited over 500 times, proving the utility of the idea. Using modern python and RDKit tools a similar end product can be accomplished with a fairly small amount of code. (RDKit) Future steps may include implementing an element of synthesizability to score the compounds. For example, identifying if a compound could be synthesized with a Suzuki coupling.

In the first step a target number of atoms defining the molecule size is randomly generated from a desired range. The approach is to start with a single carbon, then in successive steps randomly choose an atom to connect an additional carbon atom in order to build a carbon framework with the desired number of atoms.

In the next steps bonds are randomly chosen to be promoted to double or triple bonds, if the valence permits, and then atoms randomly chosen to be changed to one of a list of heteroatoms, if the valence permits. Each of these steps is guided with Gaussian-distributed probabilities chosen to make molecules somewhat similar to known molecules.

To obtain the probabilities of known molecules we looked at the ~30 million molecules indexed in Reaxys to measure the proportion of heteroatoms to carbons among the entire set in order to calibrate the generation to generally be similar to known organic molecules. This gave proportions that one can use in random generation. Some of the ratios computed from that analysis are are:

N/C - 0.645
O/C - 0.744
Cl/C - 0.154

Using the heteroatom proportions from Reaxys, of a sample of 5 million generated molecules 4,727,855 pass the Lipinski rules of drug-likeness. The molecules are generated at about 1,100 molecules per second in a single thread, making the method feasible for automated predictions. In some cases valence-valid but non-feasible bonds and structures are generated but these are rare.

A hash map was kept for all generated molecules using their InChI strings (not keys!) to detect uniqueness and count how often a particular structure was generated. In addition the conversion to InChI acted as a filter for any invalid structures that were generated. Any generated structures that can not be converted to an InChI string by RDKit are discarded. Highly strained molecules were not filtered out, but could be in post-processing steps.

All hydrocarbon isomers with six carbon atoms

The system was first tested on hydrocarbons. These are all unique generated examples of hexanes, generated at least 7 times by the algorithm. Some are generated much more often than others due to the multiple symmetric ways that the structure could be assembled.

Hydrocarbons with six atoms and multiple bonds

when multiple bonds are added the number of possible structures increases considerably

领英推荐

My Book on Generative AI Now on Amazon

Vincent Granville 1 年前

McCulloch-Pitts: The First Computational Neuron

Dr. Kais Dukes 1 年前

The Power of Abstraction in Software

Wei Li 1 年前

and when hetero atoms are assigned the number of unique structures becomes very large. In principle the algorithm should eventually generate all possible valid chemical structures containing the enumerated elements with the given range of atoms.

In order to test the algorithm we tried to filter the stream to find known drugs. The simple molecule Propofol was generated as the 667,536th generated molecule in the test run.

This project illustrates a simple way to make a continuous stream of diverse molecules for projects that seek to find novel structures.

Training AI projects, particularly transformers
Drug screening
Docking

For example if one had an accurate bioactivity prediction model for a given target one could stream generated molecules through the model and collect a sample of novel structures that are predicted to have high affinity for a target. The process can then be validated by identifying any known active molecules generated, as in the Propofol example.

source: https://github.com/drmatthewclark/divgen

The Elsevier Professional Services Group can perform projects like this, or more expansive projects such as using technologies like this to identify new leads or train AI models. We have teams in North America, Europe, and Asia ready to work with you to achieve your goals.

References

(Reymond 2012)

Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 https://pubs.acs.org/doi/pdf/10.1021/ci300415d

(Lipkus, 2008)

Structural Diversity of Organic Chemistry. A Scaffold Analysis of the CAS Registry https://doi.org/10.1021/jo8001276

(RDKit)

RDKit: Open-source cheminformatics. https://www.rdkit.org

Alexander Godfrey

Lead Consultant for Chemistry Automation at National Center for Advancing Translational Sciences (NCATS)

3 年

Fascinating developments! I’m curious when ‘patent singularity’ will be reached - the point when establishing composition of matter novelty is no longer a tenable or reasonable argument.

1 次回应

查看更多评论

要查看或添加评论，请登录

Matthew Clark的更多文章

Lucas Nelson named MITACS Research Fellow at Ability Biologics

2025年1月29日

Lucas Nelson named MITACS Research Fellow at Ability Biologics

We are excited to announce that we have worked with Mitacs and their grants for innovative research to fund Lucas…

2 条评论
S7A Respiratory Animal Studies Fail to Predict Human Safety

2024年10月1日

S7A Respiratory Animal Studies Fail to Predict Human Safety

I am keenly interested in work to reduce animal testing, getting the most value from the remaining animal tests, and…

3 条评论
Using Data Science to Reduce Animal Testing - Crowdsourcing for Drug Repurposing

2024年1月12日

Using Data Science to Reduce Animal Testing - Crowdsourcing for Drug Repurposing

I have been working with teams to study virtual control groups to reduce the need for animals in study control groups…

6 条评论
Using Data Science To Reduce Animal Testing

2023年11月10日

Using Data Science To Reduce Animal Testing

I've continued a productive collaboration with Bayer's Thomas Steger-Hartmann studying at ways to reduce animal…

3 条评论
Neural Networks for 3D Molecular Modeling

2022年8月16日

Neural Networks for 3D Molecular Modeling

The image below illustrates why the features of neural network frameworks like Keras are so powerful for modeling based…

1 条评论
Revisiting CoMFA with Neural Networks

2022年8月8日

Revisiting CoMFA with Neural Networks

This article is a preview of my presentation at the CINF session of the ACS National Meeting in Chicago in a few weeks.…

14 条评论
Lancet publication proves the high return on investment from public funding of HPV research

2021年11月5日

Lancet publication proves the high return on investment from public funding of HPV research

In 2019 Elsevier (Clark, Jayabalasingham) spoke at the American Evaluation Association conference on an evaluation of…

1 条评论
Attend the Reaxys Retrosynthesis Roundtable - July 15

2021年7月13日

Attend the Reaxys Retrosynthesis Roundtable - July 15

The Reaxys retrosynthesis roundtable is scheduled for 15th July 3-5PM CET. There is still time for you to register and…
R&D efficiency of leading pharmaceutical companies – A 20-year analysis

2021年6月28日

R&D efficiency of leading pharmaceutical companies – A 20-year analysis

The Elsevier life science team works with global life science organizations to help them increase efficiency. As part…

3 条评论
Reaction Graphs with Reaxys

2021年6月1日

Reaction Graphs with Reaxys

Knowledge graphs have been increasingly seen as a way to understand relationships among data. They are used for…

4 条评论

See all articles

Generating Diverse Molecules

Matthew Clark

Vice President of Data Science

领英推荐

Matthew Clark的更多文章

社区洞察

其他会员也浏览了

How to Write an Algorithm?

3D Fractal Dimension

New Grounds in Theorem Proving with DeepSeek-Prover-V1.5

?? Symbolic Substitution & Transformation: From Equations to AI Learning ??

?? The Future of Symbolic Differentiation & Integration in Tech ??

Deep Learning in Python with TensorFlow and Keras API for creating AI algorithms/models. Sequential models.

Implementing LSTM with TensorFlow and Python

What is Artificial Intelligence course?-Artificial intelligence course

The AIFI Newsletter: 10th October 2024

Gaussian mixture Machine Learning models and clustering. A perspective by Darko Medin.

领英推荐

Matthew Clark的更多文章

Lucas Nelson named MITACS Research Fellow at Ability Biologics

S7A Respiratory Animal Studies Fail to Predict Human Safety

Using Data Science to Reduce Animal Testing - Crowdsourcing for Drug Repurposing

Using Data Science To Reduce Animal Testing

Neural Networks for 3D Molecular Modeling

Revisiting CoMFA with Neural Networks

Lancet publication proves the high return on investment from public funding of HPV research

Attend the Reaxys Retrosynthesis Roundtable - July 15

R&D efficiency of leading pharmaceutical companies – A 20-year analysis

Reaction Graphs with Reaxys

社区洞察

其他会员也浏览了

How to Write an Algorithm?

3D Fractal Dimension

New Grounds in Theorem Proving with DeepSeek-Prover-V1.5

?? Symbolic Substitution & Transformation: From Equations to AI Learning ??

?? The Future of Symbolic Differentiation & Integration in Tech ??

Deep Learning in Python with TensorFlow and Keras API for creating AI algorithms/models. Sequential models.

Implementing LSTM with TensorFlow and Python

What is Artificial Intelligence course?-Artificial intelligence course

The AIFI Newsletter: 10th October 2024

Gaussian mixture Machine Learning models and clustering. A perspective by Darko Medin.