DATA SCIENCE AND CHEMICAL SCIENCE.  PART 5. The TEM Images-based Predictive Modeling for Differently Shaped Rh Nanoparticles in Hybrid Photocatalyst

DATA SCIENCE AND CHEMICAL SCIENCE. PART 5. The TEM Images-based Predictive Modeling for Differently Shaped Rh Nanoparticles in Hybrid Photocatalyst


Introduction.

On the basis of the article “Rhodium nanoparticles impregnated on TiO2: strong morphological effects on hydrogen production” (https://doi.org/10.1039/D0NJ02419H) authored by B. Albuquerque, G. Chacón, M. Nazarkovsky and J. Dupont data analysis of the size distribution profiles for all three subject samples (NP, NC and Oh) was performed (distributions classification, analysis of variances, discriminant analysis). The samples feature in different shape of Rh nanoparticles (NP – spherical, NC – cubic, Oh - octahedral) and, as a result, their photocatalytic activity was proven to be dependent on the morphology. The results allowed us to distinguish the types of the samples by their profiles and make predictive modeling for the machine learning algorithms to classify the samples by their particles size distributions. The developed algorithms can be deployed and serve for identification of the sample from TEM images. To this end, machine learning approaches, such as non-parametric K-Nearest Neighbors approach (maximal K = 100, metrics: Euclidean distance at the uniform weight), Na?ve Bayes, Logistic Regression, Bootstrap Forest (1 split per sample, learning rate 0.1, 35 trees) and Classification Tree (9 splits) were trained, validated and their effectiveness to predict the samples by their size distribution profiles was compared. The learning rate at 0.1 is recommended in order to avoid overfitting the real experimental data. The most effective model has turned out to be Logistic Regression, whose error, misclassification rate (MR), at the validation stage is less than 8% (7.64%). This model’s other metrics – Generalized R2, Entropy R2 – are the highest, as compared to the models at the same MR (Na?ve Bayes and Bootstrap Forest). As a result, an offline calculator for Logistic Regression was developed and provided on github.com to predict the samples type. The project was coded by means of JSL (SAS).

Methods for the analysis of variances

The analysis of variances on the particles size by type (NP, NC or Oh) was undertaken using homo-heteroscedasticity comparison with the Bryan-Forsythe, Bartlett, Levene and O’Brien tests, assuming the null hypothesis (Ho) that the variances are equal, but each uses a different method for measuring variability:

The homo/heteroscedasticity comparison (equality test) was undertaken by two factors (air and nitrogen) using the Brown-Forsythe, Bartlett, Levene’s and O’Brien tests, assuming the null hypothesis (Ho , p >> 0.05) that the variances are equal. Each approach uses a different method for measuring variability. Levene’s test estimates the mean value of these absolute differences for each group. Then, a t-test is conducted (or equivalently, an F-test) on these estimates. The Brown-Forsythe test measures the differences from the median instead of the mean (opposed to Levene’s test) and tests these differences. O’Brien’s test tricks the t-test by telling it that the means were really variances. Bartlett’s test derives the test mathematically, using an assumption that the data are normal. Despite its power, the subject test is not robust in respect to disnormality.

The non-parametric Kruskal-Wallis test with the means comparison Tukey-Kramer test were applied in this study. The α-criterion of acceptance/reject of Ho is 0.05. 


Results

The samples NC and Oh are characterized by normal profiles according to the criterion of Anderson-Darling, except NP, whose nanoparticles size range is the narrowest and in general the particles are smaller (Fig.1). The p-value for NP is << 0.05 reasoning the rejection of Ho for normal distribution. 

No alt text provided for this image

The variances are revealed to be heteroscedastic by all four tests and by Welch’s method (p << 0.05), non-parametric Kruskall-Wallis test rejected a null-hypothesis of the equality for three samples and the means comparison by Tukey-Kramer confirmed that all three samples are non-identical by their nanoparticle size – the means are significantly different and the HSD threshold matrix indices are positive (Fig.2).

No alt text provided for this image

The discriminant analysis at the overall misclassification rate of 12.38 % allowed us to test the algorithms for practical digital recognition of the samples by size. Moreover, fully correct classification of all the sizes for NP is promising for development of the machine learning models.

No alt text provided for this image

To train and validate (test) the models, all the data of 525 points (150 for NC, 150 for NP and 225 for Oh) were divided for the training set (70% from both samples – 105 from NC or NP, and 158 from Oh) and validation set (30% from both samples – 45 from NC or NP, and 67 from Oh) – Fig. 4.

No alt text provided for this image

The Classification Tree model after 9 splits misclassified (recognized incorrectly) by the overall error of 8.28%, hence, ca. 92% of the samples were attributed correctly within this model – more exactly said, NP and NC were classified right in 100%, i.e. the right error for 67 Oh samples is 19.4%, since 13 were attributed mistakenly to NC (Fig. 5). However, the representative tree gives a simple determination of each sample by probability associated to the particles size “cutting” NP at the size of 3.8 nm (Fig. 6).

No alt text provided for this image
No alt text provided for this image

The Na?ve Bayes model recognizes NP and NC in 100% cases - no erroneous attribution to Oh and the error for Oh is less than for Decision Tree, it is 17.9% (12 incorrect from 67) – Fig. 7. Thus, the model is stronger than Classification Tree. Despite correct classification of NC, its AUC is not 100%, as opposed to NP (Fig.8).

No alt text provided for this image
No alt text provided for this image

Logistic Regression model is equilibrated by 1 incorrect attribution from 45 for NC and 11 from 67 Oh giving an overall misclassification rate of 7.64% while validating the model (Fig.9). Highly competitive precision between Na?ve Bayes and Logistic Regression leads to a necessary step of comparing the models by other metrics, since their misclassification rates are identical. From the Logistic Regression has a slightly higher Generalized R2, Entropy of R2 and lower –log p than Na?ve Bayes does and, thus, is more effective. Additionally, the AUCs in Logistic Regression for NC and Oh are slightly higher (Fig.10). However, both models can be deployed for digital recognition of the samples by the particle size.

No alt text provided for this image
No alt text provided for this image

Interestingly, the K-NN (K-Nearest Neighbors) model has the same minimal MR at K = 15 (Fig.11). The NP sample was classified without any error and 16.4% of Oh was “recognized” by this simple algorithm as NC, whereas the latter was classified almost correctly – only 1 point from 45 was assigned to Oh. From the size bar plot of Euclidean distances for each sample, it is clearly seen that NC and Oh have similar values for respective size regions and it may be a reason for erroneous partial classification in the NC-Oh pair (Fig.12). 

No alt text provided for this image
No alt text provided for this image

The only model which gave the minimal error (13.4%) in Oh attribution is Bootstrap Forest at the number of trees = 35 and a single term sampled per split (Fig.13). The MR does not change since the 31st tree and is 7.64% (Fig.14) due to misclassification of NC – 1 to NP and 2 to Oh. This is why, AUC for NC is smaller than for Logistic Regression and Na?ve Bayes - 94.91% (Fig.15) and the metrics table supports the conclusion that this model is weaker for the overall prediction. 

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image


Conclusions

Finally, Logistic Regression is concluded to be the most accurate predictive model to distinguish three Rh-TiO2 samples just from the microphotographs after the particles size distribution analysis. The proposed equation is following:

Lin(NC) = 10.24 -1.57 * Size (nm)

Lin(NP) = 93.47 -25.08 * Size (nm)

Probability for NC = 1/(1+exp(-Lin[NC])+exp(Lin[NP]-Lin[NC]))

Probability for NP = 1/(1+exp(Lin[NC]-Lin[NP])+exp(-Lin[NP]))

Probability for Oh = 1/(1+exp(Lin[NC])+exp(Lin[NP]))

Probability for Oh + Probability for NC + Probability for NP = 1

The interactive open-code HTML-profiler for offline control of the probability for each sample is available on https://github.com/Nazarkovsky/Rh-TiO2-classificator

In order to open the file, download, unzip and open it in your browser.

Let’s give ourselves up to the data science in chemistry and chemical industry!

To be continued

Acknowledgements

The author of the present communication is grateful to Dr. B. Albuquerque, Dr. G. Chacón and Prof. J. Dupont for invitation to be a coauthor of the original research article. Special thanks to Dr. David Kirmayer (The Hebrew University of Jerusalem, Israel) for consulting.

The cover image was reproduced by permission of The Royal Society of Chemistry (RSC) on behalf of the European Society for Photobiology, the European Photochemistry Association, and RSC.

Zakhar (Zacki) Nudelman, PhD, MBA

Pharmaceuticals, Food-tech, Biotechnology, Cannabis, Cosmetics and Natural products

4 年

Well done!

要查看或添加评论,请登录

Michael Nazarkovsky PhD的更多文章

社区洞察

其他会员也浏览了