登录查看更多内容

Gaussian mixture Machine Learning models and clustering. A perspective by Darko Medin.

Darko Medin

Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.

发布日期: 2020年4月9日

Clustering is a group of Machine Learning methods primarily used un Unsupervised analysis however today its more and more frequently used in Supervised Analysis and even models like Recommender systems. For their ability to efficiently identify groups in the data these methods became very useful tools in today's Molecular biology, Biostatistics and Bioinformatics and i will mainly focus on this area in the article.

These methods strongly rely on computation as they are generally in algorithm form. Knowing programming languages is essential in performing these tasks indeed. However, Statistical basis is something that is less in the focus of Scientific communities.

This is an example of identifying Clusters in Iris dataset[1]. However some randomization is introduced by selecting 121 flowers. Red color is a very defined cluster of Iris setosa, green one is for Iris versicolor and blue one for Iris virginica mostly. The model is created in R[2] using a Gaussian Mixture model algorithm (GMM, mclust)[3,4]. The algorithm does perform well on this dataset, since its quite continuous data and GMM is indeed an advanced parametric based approach. Centers are means which is good from generalization perspective. Gaussian probability states that as we approach infinitely big samples we would get highest probabilities and data densities in those means as well as predicted expectation areas around them. Generalization is something that is very important in clustering statistics especially when analyzing smaller samples. Achieving accurate assessment of these confidence areas is an essential step, maybe even more important then finding centers in the first place.

Inter-cluster distance is another very important parameter in parametric clustering. Finding the center based on means and variability is the Statistical foundation and then if the means are relevant, euclidean distances between those means will be relevant as well. Uncertainty in classification has a dependency with inter-cluster distance but also the intra-cluster distance.

Some general statistical logic is that decreasing inter-cluster distance and increasing intra-cluster distance or variability will increase a level of uncertainty in classification. So clusters that are close to each other with enough variability in data could have some uncertainty in predicted classes generally close to the border line between them.

Sensitivity to outliers is often an issue when using parametric approaches in estimating this more generalized expectations. In molecular biology its generally hard to determine and outlier standard. Sometimes extreme values are something that can not be disregarded as it would be biologically important so a non-parametric approach might be better in similar cases.

As in this example there are 3 Iris species here I.setosa, I.versicolor and I.virginica. There are many variables that could coexist there which could be confounders. Each flower has different flowering times, amounts of water and other nutrients in them and thus Sepals and Petals could behave differently according to that.These additional effects could easily affect the parametric model, but GMM algorithm will be close to what is an accurate expectation in general because of the Bayes estimation [3] included in the algorithm. Even though i am a Biostatistician i do have a Biological background so i know the importance of these fine details in any biological assessment.

Also a lot of data is non-parametric in molecular biology and bioinformatics. There are far less totally non-parametric than parametric approaches in Clustering. Its also a good idea to include a more statistics based approaches in clustering as its easier to analyses data distributions and find suitable algorithms for them.

Most generalized parametric model based Clustering algorithms perform great in classification tasks given that good variables (normal/continuous data) are inputted in the model. However some data aspects do know to be sensitive to outliers and confounders and non-parametric approaches are more appropriate in many cases. Inter-cluster and intra-cluster distances are very important for classifications uncertainty analysis in general but especially in parametric approaches. Biological intuition and Statistical reasoning do play a role in making the final interpretation of the results in the end.

by Darko Medin, a Biostatistics and Data Science expert

References

1.R. A. Fisher (1936). "The use of multiple measurements in taxonomic problems". Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x. (IRIS)

2. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

3. Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models The R Journal 8/1, pp. 205-233

4. Breslow, N. E.; Clayton, D. G. (1993), "Approximate Inference in Generalized Linear Mixed Models", Journal of the American Statistical Association, 88 (421): 9–25, doi:10.2307/2290687, JSTOR 2290687

Muneiwa Reineth Tshikuvhe

KwaZulu-Natal Herbarium

4 年

I see very nice clustering patterns there, if I may ask, which tool did you use to run the analysis ?

1 次回应

Thabo. D Mohlala

Director I Board Member I Scientist I Research Associate I Wildlife Photographer I

4 年

I really like how you present data and always looking forward to your posts.Well done

2 次回应

查看更多评论

要查看或添加评论，请登录

Darko Medin的更多文章

Medium level tutorials for t ML/AI, Deep Learning, text processing by Darko Medin

2025年4月1日

Medium level tutorials for t ML/AI, Deep Learning, text processing by Darko Medin

Some time ago i wrote a bunch of medium level tutorials for writing Machine Learning and Deep Learning code in Python…
OncoNeo400 - New AI Confidence Interval feature

2025年3月25日

OncoNeo400 - New AI Confidence Interval feature

What's one of the main aspects that can bring a Statistical Advantage to an AI model? Improving individual predictions…
OncoNeo400 - A new Precision Oncology Research AI tool on BioAIWorks

2025年3月16日

OncoNeo400 - A new Precision Oncology Research AI tool on BioAIWorks

In this edition the OncoNeo400, novel Precision Oncology Research AI tool on BioAIWorks platform (bioaiworks.com).

7 条评论
LARVOL CLIN - New modules

2025年3月3日

LARVOL CLIN - New modules

This featuring article is about the new modules Larvol Pseudo-IPD and Larvol NMA on https://clin.larvol.

1 条评论
AI Developer tech skillsets.

2025年2月24日

AI Developer tech skillsets.

While these skills may vary according to the role, i will discuss the most significant ones that almost every AI…

2 条评论
Featuring article - the book : How To Be an Effective Statistician by Dr. Alexander Schacht

2025年2月16日

Featuring article - the book : How To Be an Effective Statistician by Dr. Alexander Schacht

The book How To Be an Effective Statistician: A Guide for Statisticians, Data Scientists, and Other Quantitative…

2 条评论
Causal Inference II Live - The ORIENTATION

2025年2月11日

Causal Inference II Live - The ORIENTATION

Causal Inference II is a Live Linkedin Event by Justin Bélair and Darko Medin . Here is the orientation on how and when…

9 条评论
Simulated and Synthetic Data Generation - Edition 1

2024年10月31日

Simulated and Synthetic Data Generation - Edition 1

The first in the series for Simulated and Synthetic Data Generation - by Darko Medin. Where to read :…
Simulated and Synthetic Data Series by Darko Medin - An ORIENTATION

2024年10月20日

Simulated and Synthetic Data Series by Darko Medin - An ORIENTATION

This is the orientation for my upcoming Series on Simulated and Synthetic Data. If you have any additional suggestions…

5 条评论
Simulated and Synthetic Data Generation - The Effective Statistician Workshop ORIENTATION - Lead by Darko Medin

2024年10月13日

Simulated and Synthetic Data Generation - The Effective Statistician Workshop ORIENTATION - Lead by Darko Medin

In today's data-driven world ability to generate Simulated and Synthetic data is one of the most important Data Science…

See all articles

Gaussian mixture Machine Learning models and clustering. A perspective by Darko Medin.

Darko Medin

Data Scientist and a Biostatistician. Developer of ML/AI models. Researcher in the fields of Biology and Clinical Research. Helping companies with Digital products, Artificial intelligence, Machine Learning.

Darko Medin的更多文章

社区洞察

其他会员也浏览了

?? Interpolation, Curve Fitting & Approximation: Predicting Trends with Math! ??

Why Math is Essential for Data Science and Machine Learning: What You Need to Know!

Importance of learning Linear Algebra in Data Science

Announcing ONNX Support in LinkedIn’s Open-Source Isolation Forest Library

Data Science's Quantum Leap: Reflections on Two Decades of Progress

The Indispensable NumPy:

Vectorization

From Eigendecomposition to Determinant: Fundamental Mathematics for Machine Learning with Intuitive Examples Part 3/3

Training a PyTorch Convolutional Neural Network (CNN): Image Folder Dataset vs Numpy.

AI IN DISCRETE MATHEMATICS: Unlocking the Power of Algorithms and Logic

Darko Medin的更多文章

Medium level tutorials for t ML/AI, Deep Learning, text processing by Darko Medin

OncoNeo400 - New AI Confidence Interval feature

OncoNeo400 - A new Precision Oncology Research AI tool on BioAIWorks

LARVOL CLIN - New modules

AI Developer tech skillsets.

Featuring article - the book : How To Be an Effective Statistician by Dr. Alexander Schacht

Causal Inference II Live - The ORIENTATION

Simulated and Synthetic Data Generation - Edition 1

Simulated and Synthetic Data Series by Darko Medin - An ORIENTATION

Simulated and Synthetic Data Generation - The Effective Statistician Workshop ORIENTATION - Lead by Darko Medin

社区洞察

其他会员也浏览了

?? Interpolation, Curve Fitting & Approximation: Predicting Trends with Math! ??

Why Math is Essential for Data Science and Machine Learning: What You Need to Know!

Importance of learning Linear Algebra in Data Science

Announcing ONNX Support in LinkedIn’s Open-Source Isolation Forest Library

Data Science's Quantum Leap: Reflections on Two Decades of Progress

The Indispensable NumPy:

Vectorization

From Eigendecomposition to Determinant: Fundamental Mathematics for Machine Learning with Intuitive Examples Part 3/3

Training a PyTorch Convolutional Neural Network (CNN): Image Folder Dataset vs Numpy.

AI IN DISCRETE MATHEMATICS: Unlocking the Power of Algorithms and Logic