Generating Diverse Molecules

Generating Diverse Molecules

In our AI projects for chemistry and retrosynthesis we often need sets of diverse molecules as comparators of diversity, training molecular transformers, for in-silico screening, and as general examples.

So we created a python program to generate any arbitrary number of random valid organic molecules within a given range of heavy atoms and list of organic elements.

This molecule stream could be used with AI models such as the recently reported EquiBind from MIT that can quickly evaluate molecules to estimate target binding

While enumeration of structures has been done to create a large database (Reymond 2012), and to evaluate diversity (Lipkus 2008). Random generation work has been reported using the commercial tool "Molgen". storing and managing large databases can be inconvenient. It can be more convenient to generate a stream "on demand" whenever a large number of diverse molecules is needed. The Reymond paper has been cited over 500 times, proving the utility of the idea. Using modern python and RDKit tools a similar end product can be accomplished with a fairly small amount of code. (RDKit) Future steps may include implementing an element of synthesizability to score the compounds. For example, identifying if a compound could be synthesized with a Suzuki coupling.

In the first step a target number of atoms defining the molecule size is randomly generated from a desired range. The approach is to start with a single carbon, then in successive steps randomly choose an atom to connect an additional carbon atom in order to build a carbon framework with the desired number of atoms.

In the next steps bonds are randomly chosen to be promoted to double or triple bonds, if the valence permits, and then atoms randomly chosen to be changed to one of a list of heteroatoms, if the valence permits. Each of these steps is guided with Gaussian-distributed probabilities chosen to make molecules somewhat similar to known molecules.

To obtain the probabilities of known molecules we looked at the ~30 million molecules indexed in Reaxys to measure the proportion of heteroatoms to carbons among the entire set in order to calibrate the generation to generally be similar to known organic molecules. This gave proportions that one can use in random generation. Some of the ratios computed from that analysis are are:

  • N/C - 0.645
  • O/C - 0.744
  • Cl/C - 0.154

Using the heteroatom proportions from Reaxys, of a sample of 5 million generated molecules 4,727,855 pass the Lipinski rules of drug-likeness. The molecules are generated at about 1,100 molecules per second in a single thread, making the method feasible for automated predictions. In some cases valence-valid but non-feasible bonds and structures are generated but these are rare.

A hash map was kept for all generated molecules using their InChI strings (not keys!) to detect uniqueness and count how often a particular structure was generated. In addition the conversion to InChI acted as a filter for any invalid structures that were generated. Any generated structures that can not be converted to an InChI string by RDKit are discarded. Highly strained molecules were not filtered out, but could be in post-processing steps.

All hydrocarbon isomers with six carbon atoms

The system was first tested on hydrocarbons. These are all unique generated examples of hexanes, generated at least 7 times by the algorithm. Some are generated much more often than others due to the multiple symmetric ways that the structure could be assembled.


Hydrocarbons with six atoms and multiple bonds

when multiple bonds are added the number of possible structures increases considerably




No alt text provided for this image

and when hetero atoms are assigned the number of unique structures becomes very large. In principle the algorithm should eventually generate all possible valid chemical structures containing the enumerated elements with the given range of atoms.



In order to test the algorithm we tried to filter the stream to find known drugs. The simple molecule Propofol was generated as the 667,536th generated molecule in the test run.

propofol

This project illustrates a simple way to make a continuous stream of diverse molecules for projects that seek to find novel structures.

  • Training AI projects, particularly transformers
  • Drug screening
  • Docking

For example if one had an accurate bioactivity prediction model for a given target one could stream generated molecules through the model and collect a sample of novel structures that are predicted to have high affinity for a target. The process can then be validated by identifying any known active molecules generated, as in the Propofol example.

source: https://github.com/drmatthewclark/divgen


The Elsevier Professional Services Group can perform projects like this, or more expansive projects such as using technologies like this to identify new leads or train AI models. We have teams in North America, Europe, and Asia ready to work with you to achieve your goals.

References

(Reymond 2012)

Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 https://pubs.acs.org/doi/pdf/10.1021/ci300415d

(Lipkus, 2008)

Structural Diversity of Organic Chemistry. A Scaffold Analysis of the CAS Registry https://doi.org/10.1021/jo8001276

(RDKit)

RDKit: Open-source cheminformatics. https://www.rdkit.org

Alexander Godfrey

Lead Consultant for Chemistry Automation at National Center for Advancing Translational Sciences (NCATS)

3 年

Fascinating developments! I’m curious when ‘patent singularity’ will be reached - the point when establishing composition of matter novelty is no longer a tenable or reasonable argument.

要查看或添加评论,请登录

Matthew Clark的更多文章

社区洞察

其他会员也浏览了