Gaussian mixture Machine Learning models and clustering. A perspective by Darko Medin.
Darko Medin
Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.
Clustering is a group of Machine Learning methods primarily used un Unsupervised analysis however today its more and more frequently used in Supervised Analysis and even models like Recommender systems. For their ability to efficiently identify groups in the data these methods became very useful tools in today's Molecular biology, Biostatistics and Bioinformatics and i will mainly focus on this area in the article.
These methods strongly rely on computation as they are generally in algorithm form. Knowing programming languages is essential in performing these tasks indeed. However, Statistical basis is something that is less in the focus of Scientific communities.
This is an example of identifying Clusters in Iris dataset[1]. However some randomization is introduced by selecting 121 flowers. Red color is a very defined cluster of Iris setosa, green one is for Iris versicolor and blue one for Iris virginica mostly. The model is created in R[2] using a Gaussian Mixture model algorithm (GMM, mclust)[3,4]. The algorithm does perform well on this dataset, since its quite continuous data and GMM is indeed an advanced parametric based approach. Centers are means which is good from generalization perspective. Gaussian probability states that as we approach infinitely big samples we would get highest probabilities and data densities in those means as well as predicted expectation areas around them. Generalization is something that is very important in clustering statistics especially when analyzing smaller samples. Achieving accurate assessment of these confidence areas is an essential step, maybe even more important then finding centers in the first place.
Inter-cluster distance is another very important parameter in parametric clustering. Finding the center based on means and variability is the Statistical foundation and then if the means are relevant, euclidean distances between those means will be relevant as well. Uncertainty in classification has a dependency with inter-cluster distance but also the intra-cluster distance.
Some general statistical logic is that decreasing inter-cluster distance and increasing intra-cluster distance or variability will increase a level of uncertainty in classification. So clusters that are close to each other with enough variability in data could have some uncertainty in predicted classes generally close to the border line between them.
Sensitivity to outliers is often an issue when using parametric approaches in estimating this more generalized expectations. In molecular biology its generally hard to determine and outlier standard. Sometimes extreme values are something that can not be disregarded as it would be biologically important so a non-parametric approach might be better in similar cases.
As in this example there are 3 Iris species here I.setosa, I.versicolor and I.virginica. There are many variables that could coexist there which could be confounders. Each flower has different flowering times, amounts of water and other nutrients in them and thus Sepals and Petals could behave differently according to that.These additional effects could easily affect the parametric model, but GMM algorithm will be close to what is an accurate expectation in general because of the Bayes estimation [3] included in the algorithm. Even though i am a Biostatistician i do have a Biological background so i know the importance of these fine details in any biological assessment.
Also a lot of data is non-parametric in molecular biology and bioinformatics. There are far less totally non-parametric than parametric approaches in Clustering. Its also a good idea to include a more statistics based approaches in clustering as its easier to analyses data distributions and find suitable algorithms for them.
Most generalized parametric model based Clustering algorithms perform great in classification tasks given that good variables (normal/continuous data) are inputted in the model. However some data aspects do know to be sensitive to outliers and confounders and non-parametric approaches are more appropriate in many cases. Inter-cluster and intra-cluster distances are very important for classifications uncertainty analysis in general but especially in parametric approaches. Biological intuition and Statistical reasoning do play a role in making the final interpretation of the results in the end.
by Darko Medin, a Biostatistics and Data Science expert
References
1.R. A. Fisher (1936). "The use of multiple measurements in taxonomic problems". Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x. (IRIS)
2. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
3. Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models The R Journal 8/1, pp. 205-233
4. Breslow, N. E.; Clayton, D. G. (1993), "Approximate Inference in Generalized Linear Mixed Models", Journal of the American Statistical Association, 88 (421): 9–25, doi:10.2307/2290687, JSTOR 2290687
KwaZulu-Natal Herbarium
4 年I see very nice clustering patterns there, if I may ask, which tool did you use to run the analysis ?
Director I Board Member I Scientist I Research Associate I Wildlife Photographer I
4 年I really like how you present data and always looking forward to your posts.Well done