Application of Machine Learning algorithms in modeling the role of the Microbiome in the Colorectal Cancer diagnosis and therapy - Part 3
Miodrag Cekikj, PhD CSE
Transforming Businesses with Applied AI | R&D Lead Technical Consultant @ ?IWConnect | Microsoft MVP | Technical Trainer | Web3 & Blockchain Practitioner
Bioinformatics Framework design and Methodology - Machine Learning Modelling Results for understanding the colorectal cancer carcinogenesis
In one of the previous articles, I made an overview of the designing and developing of a comprehensive bioinformatics framework and machine learning pipeline for deep microbiome data analysis and interpretation. So far, I have applied the methodology and elaborated on the technical results and interpretation of the key biomarkers that can play a significant role in understanding the therapy-resistant mechanism for patients diagnosed with colorectal cancer (CRC). This article will follow the identical approach for the second CRC carcinogenesis case study covering the samples described by the same Tubular Adenoma histology. Referring to the data demographics overview, this group consisted of 23 representatives from patients with pre-operative Tubular Adenoma (Adenoma) and 21 samples diagnosed with a post-operative Newly Developed Adenoma (NDA).
* Note: Considering this case study follows the same design and implementation, I will explicitly elaborate only the main modelling phase, the high contribution features, and the statistical analysis results.
ML Modeling Results
As mentioned before, after applying the data normalization and scaling techniques, I calculated Cronbach`s alpha and Cohen`s kappa coefficients, respectively. Referring to the previous definition, The Cronbach`s alpha coefficient value thresholds can be explained based on the following stages: Early stage of research (0.5 or 0.6/0.7); Applied research 0.8; When making an important decision 0.9. Usually, Cronbach`s alpha value > 0.75 is considered acceptable for microbiome-related studies. On the other hand, Cohen`s kappa coefficient is determined by the following stages: <0.4 is considered poor; 0.4 - 0.75 is considered moderate to good; >0.75 represents excellent data agreement. The results from these calculations are presented in the table below:
The general ML modelling performance metrics for the pre-operative Adenoma and post-operative NDA individuals’ group are presented in the following table.
Additionally, I also decided to calculate the Precision, Recall and F1-Score metrics for both subgroups, respectively. The results are displayed in the following table:
Identical to the previous immunotherapy effect case study, I also tried XGBoost and AdaBoost algorithms, which resulted in no significant improvements compared with the forest-based approach described above. Therefore, I identified the second-phase Python-based random forest classifier as the most performant and selected the resulting most important features as a reference set for further statistical analysis.
Statistical Analysis and Highly Contributing Features Results
The comparison for the Adenoma and NDA groups of samples presented a total of 86 unique genera. Subsequently, there were 28 separated by the ML algorithm from these genera as the most important features (32.6%) ranking in an interval of statistically calculated Benjamini-Hochberg p-value from 0.002 to 0.048 between the groups. Therefore, in the pre-operative Adenoma group, I found the Oscillospiraceae-UCG-002**, Anaerovoracaceae group, Ruminococcus, Prevotella, Lachnospiraceae, FCS020 group and Blautia as genera biologically interesting for further analysis and interpretation. Accordingly, the most significant genera among the post-operative NDA samples belong to Tyzzerella, Bifidobacterium and Lachnoclostridium.
** Note: The designed bioinformatics framework and pipelines identified some unclassified genome sequences (UCG) that need to be additionally investigated. This could potentially result from the applied taxonomic analysis and reannotation of the raw reads against updated bacterial references – using the SILVA 138.1–16s reference database (latest reference database update on 27 August 2020).
领英推荐
I completed the general insights picture providing the statistical analysis results for genera abundances in resistant and non-resistant groups visualized in the following diagram:
Biological analysis and interpretation
The most compelling genus detected as an important feature between the samples of patients with newly developed adenoma and patients diagnosed with tubular adenoma before clinical treatment was Prevotella. Prevotella is primarily reported to be present in the oral microbiome, only to be found in relatively high bacterial abundance in proximal colon cancer, which according to research, appears to be associated with elevated IL17-producing cells in the mucosa of patients with CRC. Conversely, as mentioned in the original publication, one study on Prevotella in the transgenic mouse showed that this genus promotes the differentiation of Th17 cells that primarily colonize the gut and migrate to the bone marrow, where they support the progression of multiple myeloma.
Conclusion
The study documented in this series of articles introduced a multidisciplinary systematic approach and a methodology for observing CRC drug-resistance mechanism and carcinogenesis using the microbial composition specified at the genus level. Leveraging the concepts of the bioinformatics studies, I developed different highly performant machine learning models to assist clinicians in efficiently analyzing resistant patients' microbiome diversity to address and threaten tumor proliferation, newly developed adenoma, inflammation promotion, and potential DNA damage. In terms of this, I identified the Random Forest Classifier as the most suitable algorithm for empowering follow-up techniques for feature significance interpretation. The significant features relevance obtained from the models was further observed using the stochastic algorithm's nature, where I retrieved additional data insights and variables' importance ranks. Additionally, I incorporated a symbiotic bacteria analysis to investigate the features' correlation and interaction (joint features contribution in correspondence to the specific resistance or adenoma class).
Thus far, many studies point out the importance of present genera in the microbiome and intend to treat it separately. This contributes to the field of predictive modeling in healthcare and points out the different perspectives of a treatment since our aggregate analysis gives clear results for the genera that are often found together in a resistant group of patients, meaning that resistance is not due to the presence of one pathogenic genus in the patient microbiome, but several bacterial genera that live in symbiosis. Also, our findings are complementary to the other microbiome related studies published in the literature showing the potential and justification of the applied approach.
The established methodology can also be used for unseen microbiome data to help oncologists decide on treatment and post-treatment strategies for immunotherapy and drug resistance understandings. From the further action points, I would like to emphasize the potential for improvement of the designed symbiotic bacterial analysis to provide a combined overview of the model's predictiveness and uncover additional deep data correlations and knowledge.
Thank you for reading this article and the whole series in general. I believe it is clear and comprehensive in covering the core concepts of the proposed methodology and technical pipeline.
Thank you for being so supportive as well, and I would be grateful if you take the time to comment, share the article and connect for further discussions and collaboration. Feel free to share your thoughts and experience in this regard.
Transforming Businesses with Applied AI | R&D Lead Technical Consultant @ ?IWConnect | Microsoft MVP | Technical Trainer | Web3 & Blockchain Practitioner
2 年Towards Data Science publication available on the following URL: https://towardsdatascience.com/application-of-machine-learning-algorithms-in-modeling-the-role-of-the-microbiome-in-the-colorectal-2c222ea6ba0.