AI for Cancer Therapeutics: Machine Learning & Biomolecular Modelling of Binding kinetics of CAR-T Cells to Hematologic Neoplastic Cells
Jong Hang Siong
I founded OTONOCO in Singapore to design and build SaaS and Mobile Apps that incorporates Generative and Agentic AI to solve complex problems in the industry
The manufacturing of CAR-T cells begins with obtaining blood samples from a patient.?T-lymphocytes were purified from the samples.?These cells were artificially activated by presenting them with an antigen of interest using magnetic bead as illustrated in Fig 1 below.?
Upon activation, they were genetically engineered by inserting a foreign gene into their genome.?The integrated transgene synthesizes receptor protein and expressed it on the cell surface so that it can bind to antigens on cancer cells.?At the manufacturing facility, the transgenic cells were washed and filled into bags before shipped to hospital to be administered back to the patient where they originated from.?In the blood circulation, the transgenic cells remove the neoplastic cells by binding the their surface antigens to trigger a cascade of events that eventually lead to the destruction of neoplastic cells (Fig 1-1).
Selection of suitable a target antigen for CAR-T therapy is imperative to increase the probability of successful binding of receptor protein to the antigen of interest.?Another important criteria of selection is the strength of binding of the receptor protein to the antigen on neoplastic cells.?This aspect of selection is discussed in length in this article.?The following diagram shows some of the promising target antigens for CAR-T therapy (Fig 1-2).
Biomolecular Kinetics Modelling of CAR-T Cells Binding
Binding Mechanism of CAR-T Receptor and Cancer Antigen
Studies have found that the coupling of CAR-T (denoted C) and neoplastic cells (denoted L) to form C.L complex is not absolutely permanent.?The C.L complex may dissociate to produce free C and L cells before forming C.L again. This phenomena is similar to reverse reaction in chemistry (Fig 2-1). ?
If C and L remain in the circulation long enough, it will achieve an equilibrium. Equilibrium is the state at which the net exchange of C and L does not change over time. The kinetics of reverse reactions can be described mathematically using Le Chatelier principle.?The question now is how do we know if C and L are in equilibrium.?One way to determine this is to conduct laboratory experiments to measure the fraction of C.L complex when different concentration of C is added.??Reverse reaction such as this can be mathematically described using Le Chatelier principle.?Dissociation Constant, Kd at equilibrium can then be determined from data collected from laboratory experiments using curve fitting techniques.?Besides Dissociation Constant, Association Constant, Ka can be used.?In this article, Kd will be used as basis for calculation.?The following illustrates derivation of design equation following Le Chatelier principle illustrated in the following diagram:
[C]?concentration of free CAR-T cells
[L]?concentration of free Leukemia/Lymphoma cells
[C.L]?concentration of coupled cells
I present here with 2 schools of thought with respect to the justification of the above design equation.
The second school of thought is adopted here because it is theoretically sound and practically possible and plausible.?Furthermore, future development of CAR-T technologies favors the latter.??As a consequence, whatever amount of total concentration of [C] we added, it is closed to concentration of free [C]:
Biomolecular Method to Determine Kd
Surrogate Datasets: Imatinib and BCR-ABL Complex Binding
I have used 2 datasets obtained from MIT Course 7.QBWx Quantitative Biology Workshop to determine the Equilibrium Constant as surrogate to CAR-T binding.?The first dataset contains measurements of BCR-ABL fraction bound by therapeutic protein, Imatinib.?
Imatinib is used to treat hematologic neoplasms such as chronic myelogenous leukemia.?It is also called Tyrosine Kinase inhibitor (TKI) manufactured by Novartis under the name of Gleevec and Glivec.?The second dataset contains measurements of the BCR-ABL-Imatinib binding fraction using fluorescent technique.?Binding of Imatinib to BCR-ABL complex increased the intensity of fluorescence picked up by spectrophotometer.
At a glance, data points in the first dataset seemed to start from (0,0) origin while data points in the second dataset appeared to have started at close to 500 fluorescence unit (FU) of intensity.
Aside, Pathogenesis of Chronic Myelogenous Leukemia
A piece of chromosome 9 and a piece of chromosome 22 break off and trade places. The BCR-ABL gene is formed on chromosome 22 where the piece of chromosome 9 attaches. The changed chromosome 22 is called the Philadelphia chromosome (National Cancer Institute).
Imatinib works by binding close to the ATP binding site of BCR-ABL. This blocks the enzyme activity of the protein semi-competitively.
Calculation of Kd using MATLAB
MATLAB was used to compute Kd values from the datasets.?For the first dataset, Equation 6 from Figure 3-1 was used as fitting equation.?Data was read and converted into a matrix.?The first column was extracted as variable x and the second variable was y.?MATLAB curve fitting function fittype was used to model the equation.?This model was then used to fit the dataset.?The following shows MATLAB codes and outcome:
Due to the apparent intercept on y-axis in the second dataset,?a constant denoted b was added to the equation.?Another practical consideration concerning measuring instrument was also included.?In laboratory, there are NO two measurement instruments that are absolutely identical (MIT 7.QBWX).?Hence, another constant denoted a which represents the inherent property of the instrument was added to the equation’s binding term.?As a result, 3 coefficients had to be determined - a, b and k.?The following diagram shows the outcome produced by MATLAB:
The value of Kd was found to be 1.438 x 10^4, intercepted y-axis at 451.7.?The instrument constant was found to be 2280.?The intercept on y-axis represents background fluorescence.?Before any binding took place, the dye used emit some background fluorescence.?Interestingly, the value of Kd intersected with the curve at 1600 FU.?This value is close to the resulting subtraction of a from b,ie., 2280 – 451.7 ~ 1800.
Machine Learning Approach
There are generally 2 machine learning approaches used to predict Drug-Target Interactions.?One is binary classification method to determine if an interaction exists for a given pair of drug and target.?Another one is regression method to estimate continuous values that indicate a drug’s ability to bind to the target of interest.?The ability to bind is also called Binding Affinity. ??Many of these methods are based on molecular structure that require three-dimensional (3D) structural information of targets which is still scarce at the time of this writing. ?In order to circumvent this condition, I have resorted to a recently developed graph-based representation learning technique developed by Thafar et al, 2022 called Affinity2Vec. ?This method has been published in Scientific Reports.?The authors have constructed a weighted heterogeneous graph that integrates data from several sources, including drug-drug similarity, target-target similarity, and Drug-Target binding affinities and equilibrium constants.?
Data Processing for Machine Learning
Two datasets were provided by the authors in github to benchmark Affinity2Vec, ie., Kiba Set and Davis Set. I have used Davis set to build machine learning regression models to predict Dissociation Constant (KD) and Binding Affinity for Drug-Target pairs.
Several variants for each target was created as follow:
The following shows snippets of python codes that I have developed to process and assemble the data.?The logics introduced into the codes largely followed those recommended by Thafar et al.
Drug IDs and Protein IDs were the first to be retrieved.?There were 68 unique Drug IDs and 442 unique Protein IDs.?The product of all Drug – Target combinations resulted in a total of 30065 Drug – Target pairs.
领英推荐
Equilibrium Constant Kd data was read as a numpy object:
Each unique Protein ID was indexed for later use.
This step is similar to the first one to produce the to store Drug – Target pairs and Equilibrium Constants, Kd as labels.
Binding Affinity values were retrieved from pickle file.?Three variants of Affinity were computed – Logarithm base 10, normalized and exponential.
These variants were added as new columns to the data frame.
Two variants of Equilibrium Constants, Kd were also computed and added as new columns to the data frame.
Drug-Target Combinations obtained from the first step were added to the data frame.
Targets for Predictions
A total of 5 targets have been created split into 2 groups: (1) Binding Affinity, (2) Equilibrium Constant, Kd
A ML model was trained for each of the above targets and its performance was measured using Test Set.
Machine Learning Training
Before the training a ML model, the final dataset was split into 3 parts: 75% training set, 20% test set and 5% acts as unseen data.?A commercially available automated ML platform from H2O.ai, DriverlessAI was used to train the data.
Performance of ML Model
Driverless AI comes bundled with a number of ML algorithms such as Decision Tree, Generalized Linear Model, LightGBM, XGBoost, etc.?During the training, different combinations of features from the data were automatically used to train intermediate models.?The performances of these models were evaluated internally through many iterations until the best model was discovered.?Upon completion of the ML model, Test Set was used to evaluate its performance using the following metrics:
First look at the Equilibrium Constant, Kd predictions.?The model performance was abysmal with MAPE of over 6000% and equally poor for the rest of the metrics.?After transforming Kd using natural logarithm, the performance improved tremendously with MAPE reduced to 31% with MSE and RMSE close to 2 respectively.
As for Binding Affinity, all 3 variants of this target showed comparable outcomes with Exponentially transformed variant scoring the best.?Normalized variant scored poorly at 33% with respect to the other 2 variants.
In conclusion, appropriate feature engineering and target transformation is crucial to train machine learning models that perform and generalize well.
Performance on Unseen Data
The Regression model to was used to predict the Kd on a total of 752 UNSEEN data.?The following screenshot shows the performance on H2O DriverlessAI:
The resulting performance on predicting Binding Affinity:
The 2 models above shows comparable results with Test Set performance.?In actual drug discovery setting, the UNSEEN data could potentially come from experiments to identify surface protein as disease biomarkers.?Curated proteins from past experiments are also ideal candidates to be screened by ML models for potential targets.
Protein and Target Selection
Prediction results of UNSEEN data for Kd and Affinity were combined using DRUG_ID and PROTEIN_NAME as join keys.
Predicted values of Kd and Affinity were standardized to 0 – 1.
In pursuing the most optimal Drug – Target pairs, I have set a new search criteria for the pairs with lowest possible Kd and highest possible Affinity.?In order to achieve this, a new column KD_AFF_DIFF was created by calculating substraction of standardized Kd from standardized Affinity.?The data was then sorted in descending order by the difference of this calculation.
Analysis of Top 5 Predictions
The top 5 pairs obtained were further analyzed using information from PUBCHEM.