Machine Learning 8: 'Clustering Algorithms'
In the last week, we explored classification and Random Forest algorithm and that was a part of Supervised Machine Learning which also consists of regression analysis and predictive modelling. There is another type of Machine Learning algorithm which are known as Unsupervised Machine Learning algorithms. In this week, we will explore unsupervised Machine Learning algorithms such as Clustering.
Supervised Learning
Machine learning can be categorized as supervised and unsupervised machine learning. Some of the well know supervised machine learning algorithms are SVM (Support Vector Machine), Linear Regression, Neural Network, Naive Bayes. In supervised learning, the training data is labelled, that means we already know the target variable we are going to predict while we test the model.
Unsupervised Classification
In unsupervised learning, the training data is unlabeled and the system tries to learn without a trainer. Some of the most important unsupervised algorithms are clustering, k-means, Association rule learning etc.
What Is Clustering?
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
Clustering is widely used in marketing to find naturally occurring groups of customers with similar characteristics, resulting in customer segmentation that more accurately depicts and predicts customer behavior, leading to more personalized sales and customer service efforts.
There are a lot of clustering algorithms each serving a specific purpose and having its own use cases. To look out clustering and it definition in a deeper aspect, here are a few links that you can go through as well.
What is Clustering in Data Mining?
Data Mining - Cluster Analysis
How Businesses Can Use Clustering in Data Mining
Numerous Clustering techniques work best for different types of data. Let’s assume that your data is a numeric and continuous two-dimensional data as shown in figure below in form of a scatter plot.
This another scatter plot is created from several "blobs" of different sizes and shapes shws the clusters that exists in the data
We will discuss a few Clustering algorithms which are Kmeans, Hierarchical Clustering.
K-means
You might be thinking that how do I decide the value of K in the first step.
One of the methods is called Elbow method can be used to decide an optimal number of clusters. Here you would run K-mean clustering on a range of K values and plot the “percentage of variance explained” on the Y-axis and “K” on X-axis as shown in the figure below. As we add more clusters after 3 it doesn't affect the variance explained.
Here is another link for you to explore the same.
Hierarchical Clustering
Unlike K-mean clustering, Hierarchical clustering starts by assigning all data points as their own cluster building the hierarchy and it combines the two nearest data point and merges it together to one cluster as shown in the Dendrogram below.
More Algorithms to Learn
§ Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
§ Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
More resources for this week:
§ The 5 Clustering Algorithms Data Scientists Need to Know
§ As for the practise for this week, you have to implement all the clustering algorithms available in Sklearn on these two Kaggle datasets.
§ Breast Cancer Wisconsin (Diagnostic) Data Set
Special thanks to Anuja Nagpal: Link - https://towardsdatascience.com/clustering-unsupervised-learning-788b215b074b
Chris Surdak: Digital Transformation, Artificial Intelligence, Cybersecurity and Blockchain Executive
6 年Fabulous mathematics... but... as Forrest Gump used to say, “stupid is as stupid does.” What few in #RPA or #AI care to discuss is the fact that crappy inputs lead to horrendous results. Automation just gets you there faster.
Technical Project Manager
6 年Did you learn the k-means? :P