CLUSTER ANALYSIS
ABSTRACT:
Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the process of grouping similar objects into different groups, or more precisely, the partitioning of a data set into subsets, so that the data in each subset according to some defined distance measure. This paper covers about clustering algorithms, benefits and its applications. Paper concludes by discussing some limitations.
INTRODUCTION:
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
?The goal of performing a cluster analysis is to sort different objects or data points into groups in a manner that the degree of association between two objects is high if they belong to the same group, and low if they belong to different groups.
?Cluster analysis differs from many other statistical methods due to the fact that it’s mostly used when researchers do not have an assumed principle or fact that they are using as the foundation of their research.
This analysis technique is typically performed during the exploratory phase of research, since unlike techniques such as factor analysis, it doesn’t make any distinction between dependent and independent variables. Instead, cluster analysis is leveraged mostly to discover structures in data without providing an explanation or interpretation.
?Put simply, cluster analysis discovers structures in data without explaining why those structures exist.
?For example, when cluster analysis is performed as part of market research, specific groups can be identified within a population. The analysis of these groups can then determine how likely a population cluster is to purchase products or services. If these groups are defined clearly, a marketing team can then target varying cluster with tailored, targeted communication.
There are three primary methods used to perform cluster analysis
● Hierarchical Cluster
● K-Means Cluster
● Two-Step Cluster
?
PROCEDURE AND DISCUSSION:
Hierarchical clustering
This is the most common method of clustering. It creates a series of models with cluster
solutions from 1 (all cases in one cluster) to n (each case is an individual cluster).
Strategies for hierarchical clustering generally fall into two types:
● Agglomerative: This is a "bottom-up" approach: each observation starts in its
own cluster, and pairs of clusters are merged as one moves up the hierarchy.
● Divisive: This is a "top-down" approach: all observations start in one cluster,
and splits are performed recursively as one moves down the hierarchy
Algorithm
The algorithm for Agglomerative Hierarchical Clustering is:
● Calculate the similarity of one cluster with all the other clusters (calculate
proximity matrix)
● Consider every data point as an individual cluster
● Merge the clusters which are highly similar or close to each other.
● Recalculate the proximity matrix for each cluster
● Repeat Steps 3 and 4 until only a single cluster remains.
?
Steps of Divisive Clustering or DIANA Hierarchical Clustering
● Initially, all points in the dataset belong to one single cluster.
● Partition the cluster into two least similar cluster
● Proceed recursively to form new clusters until the desired number of clusters is
obtained.
In order to decide which clusters should be combined (for agglomerative), or where a
cluster should be split (for divisive), a measure of dissimilarity between sets of
observations are required. In most methods of hierarchical clustering, this is achieved by
use of an appropriate metric (a measure of distance between pairs of observations),
and a linkage criterion which specifies the dissimilarity of sets as a function of the
pairwise distances of observations in the sets
Metric
Some commonly used metrics for hierarchical clustering are
NAME
Dendrogram
A dendrogram is a diagram that shows the hierarchical relationship between objects.
It is most created as an output from hierarchical clustering. The main use of a
dendrogram is to work out the best way to allocate objects to clusters. The
dendrogram below shows the hierarchical clustering of six observations shown on the
scatterplot to the left.
Types of Linkages in Clustering
The process of Hierarchical Clustering involves either clustering sub-clusters (data
points in the first iteration) into larger clusters in a bottom-up manner or dividing a
larger cluster into smaller sub-clusters in a top-down manner. During both the types
of hierarchical clustering, the distance between two sub-clusters needs to be
computed. The different types of linkages describe the different approaches to
measure the distance between two sub-clusters of data points. The different types of
linkages are: -
● Single Linkage: For two clusters R and S, the single linkage returns the
minimum distance between two points i and j such that i belongs to R and j
belongs to S.
L (R, S) = min (D (i, j)), i∈R, j∈S
K-Means Clustering
This method is used to quickly cluster large datasets. Here, researchers define the
number of clusters prior to performing the actual study. This approach is useful when
testing different models with a different assumed number of clusters
Algorithms
It is an iterative algorithm that divides the unlabelled dataset into k different
clusters in such a way that each dataset belongs only one group that has similar
properties.
The k-means clustering algorithm mainly performs two tasks:
● Determines the best value for K centre points or centroids by an iterative
process.
● Assigns each data point to its closest k-centre. Those data points which are
near to the k-centre, create a cluster
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
领英推荐
?
Working of the K-Means algorithm
● Step-1: Select the number K to decide the number of clusters.
● Step-2: Select random K points or centroids. (It can be other from the input
dataset).
● Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
● Step-4: Calculate the variance and place a new centroid of each cluster.
● Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
● Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
● Step-7: The model is ready
?Choosing the value of "K number of clusters" in K-means Clustering
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster
Sum of Squares, which defines the total variations within a cluster. The formula to
calculate the value of WCSS (for 3 clusters) is given below:
?WCSS= ∑Pi in Cluster1 distance (Pi C1)2 +∑Pi in Cluster2distance (Pi C2)2+∑Pi in CLuster3 distance (Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the distances between
each data point and its centroid within a cluster1 and the same for the other two
terms.
To measure the distance between data points and centroid, we can use any method
such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
For each value of K, calculates the WCSS value.
Plots a curve between calculated WCSS values and the number of clusters K.
The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known
as the elbow method. The graph for the elbow method looks like the below image:
Relationship in between machine learning, A.I and Deep learning
Artificial Intelligence is the concept of creating smart intelligent machines.
Machine Learning is a subset of artificial intelligence that helps you build AI-driven
applications.
Deep Learning is a subset of machine learning that uses vast volumes of data and
complex algorithms to train a model.
Types of Artificial Intelligence
● Reactive Machines
● Limited Memory
● Theory of Mind
● Self-awareness
Applications of Artificial Intelligence
● Machine Translation such as Google Translate
● Self-Driving Vehicles such as Google’s Waymo
● AI Robots such as Sophia and Aibo
● Speech Recognition applications like Apple’s Siri or OK Google
Types of Machine Learning
● Supervised Learning
● Unsupervised Learning
● Reinforcement Learning
Machine Learning Applications
● Sales forecasting for different products
● Fraud analysis in banking
● Product recommendations
● Stock price prediction
Types of Deep Neural Networks
● Convolutional Neural Network (CNN)
● Recurrent Neural Network (RNN)
● Generative Adversarial Network (GAN)
● Deep Belief Network (DBN)
Deep Learning Applications
● Cancer tumors detection
● Music generation
● Image colouring
● Object detection
CONCLUSION-
● Cluster analysis groups objects based on their similarity and has a wide application. Measures of similarity can be computed for various types of data.
● Clustering algorithms can be categorized into partitioning methods, hierarchical methods and etc
● k-means algorithms are popular partitioning-based clustering algorithms
● Elbow methods and its formulas have been explained
● Cluster analysis provides a great tool for classifying data sets into groups where the elements in each group share the same characteristics.
● Cluster analysis can be combined with other multivariate techniques such as factor analysis and principal component analysis to provide better analysis of the data.
?
REFERENCES:
1. Applied Linear Regression Models, J. Neter, W. Wasserman and M.H. Kutner.
2. Introduction to Linear Regression Analysis, D.C. Montgomery and E.A. Peck.
3. Cluster Analysis for Applications, M.R. Anderberg.
4. Multivariate Statistical Analysis, D.F. Morrison.