Receptor.AI introduces 3DProtDTA - the next generation drug-target interaction AI model
Accurate in silico prediction of the drug-target interactions in terms of their affinity and related biological activity is a keystone of modern drug discovery. Computational methods, applied in the early stages of the drug development pipeline, are able to speed it up and cut its cost significantly. However, this is only feasible if the predictions are accurate and robust enough. Otherwise they may only mislead the drug discovery team and lead it in the wrong direction. That is why thorough validation of the drug-target interaction models is of critical importance.?
Receptor.AI is proud to present its flagship AI model for structure-based drug discovery called 3DProtDTA. After more than 2 years of development the model is finally mature enough to be described in the scientific publication, which is currently under review. We’ve made it available of bioRXiv, so all interested customers and investors can evaluate our technology.
A wide range of approaches based on machine learning was recently proposed for DTA assessment. The most promising of them are based on deep learning techniques and graph neural networks to encode molecular structures. We are living in the golden age of structure-based drug discovery, which was initiated by the recent development of AlphaFold and similar protein structure prediction techniques. They made an unprecedented amount of proteins without experimentally defined structures accessible for computational drug discovery. We utilize this unprecedented amount of structural data in our model by combining AlphaFold structures with the residue-level graph representation of proteins.?
Although many proteins in our training dataset have known 3D structures deposited in the Protein Data Bank, we still use AlphaFold predictions for all proteins to make our approach unified and to avoid additional tedious pre-processing of experimentally determined structures, which are often incomplete, contain irrelevant crystallographic ligands, etc. To avoid undesirable noise from the parts of proteins, which have weak or no relation to the ligand binding, we have parsed domain annotations from UniProt to determine the ligand binding sites. Only the domains, which contain the binding site of interest are encoded to the graph network representation and fed as an input to the neural network.
The protein domain structures are converted into the residue-level graphs and their properties are encoded in the graph nodes and edges. We use 8 node features and 10 edge features, which describe connectivity and physical nature of amino acids and their interactions within the protein structure. The ligands are also encoded into the graph representation on the atomistic level using their SMILES strings and several molecular fingerprints.
We use unique approach of taking into account implicit protein flexibility and dynamics by computing the protein globule's normal modes with the elastic network model, computing the correlation patterns of collective motions and encoding them as node and edge features of the graph.
The model itself is a Graph Neural Network (GNN), which extracts features from the ligand and protein graphs followed by fully connected (FC) neural network layers.
We compared the results of our approach to seven different classical machine learning-based and deep learning-based methods, which are considered to be state-of-art at the time of writing.?
The comparative tests were performed on two widely used benchmark datasets: Davis and KIBA. The Davis dataset contains the pairs of kinase proteins and their respective inhibitors with experimentally determined dissociation constant values. There are 442 proteins and 68 ligands in this dataset. The KIBA dataset comprises scores originating from an original approach called KIBA, in which inhibitor bioactivities from different sources such as Ki, Kd and IC50 are combined to a single metric. It contains 229 unique proteins and 2111 unique ligands.
Our model is superior to all its rivals, as it is evident from tables below. Three common performance metrics are better for our AI model.
In addition to this benchmark we have also studied the capability of our technique of finding existing highly active ligands of well-known proteins among the large number of inactive compounds. This experiment mimics the real world virtual screenings, where the active compounds should be identified in the large chemical space.
领英推荐
We have chosen 8 well-known proteins that are routinely used in benchmarking and comparison of the drug discovery techniques: Carbonic anhydrase II (CA2), Androgen receptor (AR), Cathepsin D (Cath-D), Beta-secretase 1 (BACE1), Janus kinase 1 (JAK1), Cyclin-dependent kinase 2 (CDK2), Matrix metallopeptidase 12 (MMP12) and Casein kinase 2 (CK2).
We identified known ligands of these proteins with the most reliable estimates of activity (9-16 depending on the protein) and mixed them with a large number of inactive decoys (9-32K depending on the protein). Our model is then applied to the whole set of compounds and ranks them according to predicted activity. The number of known active ligands in the top 20 and top 100 ranked compounds is shown in Table 6 and visualised in Fig. 2.
It is clearly seen that our model correctly prioritises the most known ligands and places them on the very top of the ranked list for all studied proteins.?
We have also performed systematic research of the optimal choice of the ligand features, choice of the latent representation of data in GNN and performance of different GNN architectures. The results of this tedious but very useful work are presented in the preprint and could be of interest for the experts.
Thus, our 3DProtDTA model is an excellent background and one of the core technologies in the Receptror.AI drug discovery pipeline. It works in the stage of primary screening and facilitates proteome-wide selectivity prediction. It is also used for pre-processing and clustering of huge chemical spaces, creation of the focused chemical libraries and quality assessment of the libraries of compounds, provided by the customers. It is used as a component of the aggregated consensus score, which allows to rank the compounds using the single metric in the platform.
The distinctive feature of this model is graph-based representation of both protein and the ligands, which retain a significant amount of information about their connectivity and spatial arrangement without introducing excessive computational burden.
The usage of the AlphaFold database of predicted protein structures allows us to cover an unprecedented part of the protein universe, which covers almost all human proteome. Our commercial in-house version of 3DProtDTA is trained on over 7M of the protein-ligand pairs with known activity.?
We also tuned a wide range of GNN-based model architectures and their combinations to achieve the best model performance. As a result the model works blazingly fast and allows us to perform virtual screening of enormous multi-billion chemical spaces in less than one day.
Receptor.AI improves its technologies on the rolling basis and the customers of our drug discovery platform always get the latest state-of-the-art set of tools. By the few weeks, which have passed since the submission of this paper, we deployed the next version of the platform, which features an even more advanced release of 3DProtDTA.?
In the near future we also plan to prepare scientific publications about other pieces of our core technologies, so stay tuned and contact us if you want to know more.