AI for Cancer Therapeutics: Machine Learning & Biomolecular Modelling of Binding kinetics of CAR-T Cells to Hematologic Neoplastic Cells

AI for Cancer Therapeutics: Machine Learning & Biomolecular Modelling of Binding kinetics of CAR-T Cells to Hematologic Neoplastic Cells

No alt text provided for this image
The quote was taken from Narrative Economics course on Coursea with the permission from Professor Robert J. Schiller


The manufacturing of CAR-T cells begins with obtaining blood samples from a patient.?T-lymphocytes were purified from the samples.?These cells were artificially activated by presenting them with an antigen of interest using magnetic bead as illustrated in Fig 1 below.?

No alt text provided for this image
Fig 1


Upon activation, they were genetically engineered by inserting a foreign gene into their genome.?The integrated transgene synthesizes receptor protein and expressed it on the cell surface so that it can bind to antigens on cancer cells.?At the manufacturing facility, the transgenic cells were washed and filled into bags before shipped to hospital to be administered back to the patient where they originated from.?In the blood circulation, the transgenic cells remove the neoplastic cells by binding the their surface antigens to trigger a cascade of events that eventually lead to the destruction of neoplastic cells (Fig 1-1).

No alt text provided for this image
Fig 1-1

Selection of suitable a target antigen for CAR-T therapy is imperative to increase the probability of successful binding of receptor protein to the antigen of interest.?Another important criteria of selection is the strength of binding of the receptor protein to the antigen on neoplastic cells.?This aspect of selection is discussed in length in this article.?The following diagram shows some of the promising target antigens for CAR-T therapy (Fig 1-2).

No alt text provided for this image
Fig 1-2


Biomolecular Kinetics Modelling of CAR-T Cells Binding

Binding Mechanism of CAR-T Receptor and Cancer Antigen

Studies have found that the coupling of CAR-T (denoted C) and neoplastic cells (denoted L) to form C.L complex is not absolutely permanent.?The C.L complex may dissociate to produce free C and L cells before forming C.L again. This phenomena is similar to reverse reaction in chemistry (Fig 2-1). ?

No alt text provided for this image
Fig 2-1

If C and L remain in the circulation long enough, it will achieve an equilibrium. Equilibrium is the state at which the net exchange of C and L does not change over time. The kinetics of reverse reactions can be described mathematically using Le Chatelier principle.?The question now is how do we know if C and L are in equilibrium.?One way to determine this is to conduct laboratory experiments to measure the fraction of C.L complex when different concentration of C is added.??Reverse reaction such as this can be mathematically described using Le Chatelier principle.?Dissociation Constant, Kd at equilibrium can then be determined from data collected from laboratory experiments using curve fitting techniques.?Besides Dissociation Constant, Association Constant, Ka can be used.?In this article, Kd will be used as basis for calculation.?The following illustrates derivation of design equation following Le Chatelier principle illustrated in the following diagram:

[C]?concentration of free CAR-T cells

[L]?concentration of free Leukemia/Lymphoma cells

[C.L]?concentration of coupled cells

No alt text provided for this image
Fig 2 2 Le Chatelier principles for binding kinetics


I present here with 2 schools of thought with respect to the justification of the above design equation.

No alt text provided for this image
Fig 2-3

The second school of thought is adopted here because it is theoretically sound and practically possible and plausible.?Furthermore, future development of CAR-T technologies favors the latter.??As a consequence, whatever amount of total concentration of [C] we added, it is closed to concentration of free [C]:

No alt text provided for this image


Biomolecular Method to Determine Kd

Surrogate Datasets: Imatinib and BCR-ABL Complex Binding

I have used 2 datasets obtained from MIT Course 7.QBWx Quantitative Biology Workshop to determine the Equilibrium Constant as surrogate to CAR-T binding.?The first dataset contains measurements of BCR-ABL fraction bound by therapeutic protein, Imatinib.?

Imatinib is used to treat hematologic neoplasms such as chronic myelogenous leukemia.?It is also called Tyrosine Kinase inhibitor (TKI) manufactured by Novartis under the name of Gleevec and Glivec.?The second dataset contains measurements of the BCR-ABL-Imatinib binding fraction using fluorescent technique.?Binding of Imatinib to BCR-ABL complex increased the intensity of fluorescence picked up by spectrophotometer.

No alt text provided for this image
Fig 3-1

At a glance, data points in the first dataset seemed to start from (0,0) origin while data points in the second dataset appeared to have started at close to 500 fluorescence unit (FU) of intensity.

Aside, Pathogenesis of Chronic Myelogenous Leukemia

A piece of chromosome 9 and a piece of chromosome 22 break off and trade places. The BCR-ABL gene is formed on chromosome 22 where the piece of chromosome 9 attaches. The changed chromosome 22 is called the Philadelphia chromosome (National Cancer Institute).

No alt text provided for this image
Fig 3-2


Imatinib works by binding close to the ATP binding site of BCR-ABL. This blocks the enzyme activity of the protein semi-competitively.

No alt text provided for this image
Fig 3-3 Source: https://en.wikipedia.org/wiki/Imatinib#/media/File:Mechanism_imatinib.svg

Calculation of Kd using MATLAB

MATLAB was used to compute Kd values from the datasets.?For the first dataset, Equation 6 from Figure 3-1 was used as fitting equation.?Data was read and converted into a matrix.?The first column was extracted as variable x and the second variable was y.?MATLAB curve fitting function fittype was used to model the equation.?This model was then used to fit the dataset.?The following shows MATLAB codes and outcome:

No alt text provided for this image
Fig 3-4

Due to the apparent intercept on y-axis in the second dataset,?a constant denoted b was added to the equation.?Another practical consideration concerning measuring instrument was also included.?In laboratory, there are NO two measurement instruments that are absolutely identical (MIT 7.QBWX).?Hence, another constant denoted a which represents the inherent property of the instrument was added to the equation’s binding term.?As a result, 3 coefficients had to be determined - a, b and k.?The following diagram shows the outcome produced by MATLAB:

No alt text provided for this image
Fig 3-5


The value of Kd was found to be 1.438 x 10^4, intercepted y-axis at 451.7.?The instrument constant was found to be 2280.?The intercept on y-axis represents background fluorescence.?Before any binding took place, the dye used emit some background fluorescence.?Interestingly, the value of Kd intersected with the curve at 1600 FU.?This value is close to the resulting subtraction of a from b,ie., 2280 – 451.7 ~ 1800.

Machine Learning Approach

There are generally 2 machine learning approaches used to predict Drug-Target Interactions.?One is binary classification method to determine if an interaction exists for a given pair of drug and target.?Another one is regression method to estimate continuous values that indicate a drug’s ability to bind to the target of interest.?The ability to bind is also called Binding Affinity. ??Many of these methods are based on molecular structure that require three-dimensional (3D) structural information of targets which is still scarce at the time of this writing. ?In order to circumvent this condition, I have resorted to a recently developed graph-based representation learning technique developed by Thafar et al, 2022 called Affinity2Vec. ?This method has been published in Scientific Reports.?The authors have constructed a weighted heterogeneous graph that integrates data from several sources, including drug-drug similarity, target-target similarity, and Drug-Target binding affinities and equilibrium constants.?

Data Processing for Machine Learning

Two datasets were provided by the authors in github to benchmark Affinity2Vec, ie., Kiba Set and Davis Set. I have used Davis set to build machine learning regression models to predict Dissociation Constant (KD) and Binding Affinity for Drug-Target pairs.

Several variants for each target was created as follow:

  • Affinity Scores – (1) Logarithm Base10, (2) Normalized and (3) Exponential
  • Kd – (1) Orignal values and (2) Natural Logarithm

The following shows snippets of python codes that I have developed to process and assemble the data.?The logics introduced into the codes largely followed those recommended by Thafar et al.

Drug IDs and Protein IDs were the first to be retrieved.?There were 68 unique Drug IDs and 442 unique Protein IDs.?The product of all Drug – Target combinations resulted in a total of 30065 Drug – Target pairs.


No alt text provided for this image
Fig 4-1

Equilibrium Constant Kd data was read as a numpy object:

No alt text provided for this image
Fig 4-2
No alt text provided for this image
Fig 4-3
No alt text provided for this image
Fig 4-4

Each unique Protein ID was indexed for later use.

No alt text provided for this image
Fig 4-5

This step is similar to the first one to produce the to store Drug – Target pairs and Equilibrium Constants, Kd as labels.

No alt text provided for this image
Fig 4-6

Binding Affinity values were retrieved from pickle file.?Three variants of Affinity were computed – Logarithm base 10, normalized and exponential.

No alt text provided for this image
Fig 4-7

These variants were added as new columns to the data frame.

No alt text provided for this image
Fig 4-8

Two variants of Equilibrium Constants, Kd were also computed and added as new columns to the data frame.

No alt text provided for this image
Fig 4-9

Drug-Target Combinations obtained from the first step were added to the data frame.

No alt text provided for this image
Fig 4.10

Targets for Predictions

A total of 5 targets have been created split into 2 groups: (1) Binding Affinity, (2) Equilibrium Constant, Kd

No alt text provided for this image
Fig 4-11

A ML model was trained for each of the above targets and its performance was measured using Test Set.

Machine Learning Training

Before the training a ML model, the final dataset was split into 3 parts: 75% training set, 20% test set and 5% acts as unseen data.?A commercially available automated ML platform from H2O.ai, DriverlessAI was used to train the data.

No alt text provided for this image


Performance of ML Model

Driverless AI comes bundled with a number of ML algorithms such as Decision Tree, Generalized Linear Model, LightGBM, XGBoost, etc.?During the training, different combinations of features from the data were automatically used to train intermediate models.?The performances of these models were evaluated internally through many iterations until the best model was discovered.?Upon completion of the ML model, Test Set was used to evaluate its performance using the following metrics:

No alt text provided for this image
Fig 4-13
No alt text provided for this image
Fig 4-14
No alt text provided for this image
Fig 4-15

First look at the Equilibrium Constant, Kd predictions.?The model performance was abysmal with MAPE of over 6000% and equally poor for the rest of the metrics.?After transforming Kd using natural logarithm, the performance improved tremendously with MAPE reduced to 31% with MSE and RMSE close to 2 respectively.

As for Binding Affinity, all 3 variants of this target showed comparable outcomes with Exponentially transformed variant scoring the best.?Normalized variant scored poorly at 33% with respect to the other 2 variants.

In conclusion, appropriate feature engineering and target transformation is crucial to train machine learning models that perform and generalize well.

Performance on Unseen Data

The Regression model to was used to predict the Kd on a total of 752 UNSEEN data.?The following screenshot shows the performance on H2O DriverlessAI:

  • R2: 0.58
  • MAPE: 25%
  • MSE: 1.9
  • RMSE: 1.4

No alt text provided for this image
Fig 4-16


The resulting performance on predicting Binding Affinity:

  • R2: 0.58
  • MAPE: 5%
  • MSE: 0.3
  • RMSE: 0.6

No alt text provided for this image
Fig 4-17 Model performance on Affinity predictions


The 2 models above shows comparable results with Test Set performance.?In actual drug discovery setting, the UNSEEN data could potentially come from experiments to identify surface protein as disease biomarkers.?Curated proteins from past experiments are also ideal candidates to be screened by ML models for potential targets.

Protein and Target Selection

Prediction results of UNSEEN data for Kd and Affinity were combined using DRUG_ID and PROTEIN_NAME as join keys.

No alt text provided for this image
Fig 4-18

Predicted values of Kd and Affinity were standardized to 0 – 1.

No alt text provided for this image
Fig 4-19Fig 4-20

In pursuing the most optimal Drug – Target pairs, I have set a new search criteria for the pairs with lowest possible Kd and highest possible Affinity.?In order to achieve this, a new column KD_AFF_DIFF was created by calculating substraction of standardized Kd from standardized Affinity.?The data was then sorted in descending order by the difference of this calculation.

No alt text provided for this image
Fig 4-20

Analysis of Top 5 Predictions

The top 5 pairs obtained were further analyzed using information from PUBCHEM.

No alt text provided for this image
Fig 4-21

Machine Learning Workflow for Discovery

No alt text provided for this image
Fig 5-1

要查看或添加评论,请登录

Jong Hang Siong的更多文章

社区洞察

其他会员也浏览了