CLUSTER ANALYSIS

CLUSTER ANALYSIS

ABSTRACT:

Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the process of grouping similar objects into different groups, or more precisely, the partitioning of a data set into subsets, so that the data in each subset according to some defined distance measure. This paper covers about clustering algorithms, benefits and its applications. Paper concludes by discussing some limitations.

INTRODUCTION:

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

?The goal of performing a cluster analysis is to sort different objects or data points into groups in a manner that the degree of association between two objects is high if they belong to the same group, and low if they belong to different groups.

?Cluster analysis differs from many other statistical methods due to the fact that it’s mostly used when researchers do not have an assumed principle or fact that they are using as the foundation of their research.

This analysis technique is typically performed during the exploratory phase of research, since unlike techniques such as factor analysis, it doesn’t make any distinction between dependent and independent variables. Instead, cluster analysis is leveraged mostly to discover structures in data without providing an explanation or interpretation.

?Put simply, cluster analysis discovers structures in data without explaining why those structures exist.

?For example, when cluster analysis is performed as part of market research, specific groups can be identified within a population. The analysis of these groups can then determine how likely a population cluster is to purchase products or services. If these groups are defined clearly, a marketing team can then target varying cluster with tailored, targeted communication.

There are three primary methods used to perform cluster analysis

● Hierarchical Cluster

● K-Means Cluster

● Two-Step Cluster

?

PROCEDURE AND DISCUSSION:

Hierarchical clustering

This is the most common method of clustering. It creates a series of models with cluster

solutions from 1 (all cases in one cluster) to n (each case is an individual cluster).

Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a "bottom-up" approach: each observation starts in its

own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This is a "top-down" approach: all observations start in one cluster,

and splits are performed recursively as one moves down the hierarchy

Algorithm

The algorithm for Agglomerative Hierarchical Clustering is:

Calculate the similarity of one cluster with all the other clusters (calculate

proximity matrix)

● Consider every data point as an individual cluster

● Merge the clusters which are highly similar or close to each other.

● Recalculate the proximity matrix for each cluster

● Repeat Steps 3 and 4 until only a single cluster remains.

?

Steps of Divisive Clustering or DIANA Hierarchical Clustering

Initially, all points in the dataset belong to one single cluster.

● Partition the cluster into two least similar cluster

● Proceed recursively to form new clusters until the desired number of clusters is

obtained.

In order to decide which clusters should be combined (for agglomerative), or where a

cluster should be split (for divisive), a measure of dissimilarity between sets of

observations are required. In most methods of hierarchical clustering, this is achieved by

use of an appropriate metric (a measure of distance between pairs of observations),

and a linkage criterion which specifies the dissimilarity of sets as a function of the

pairwise distances of observations in the sets

Metric

Some commonly used metrics for hierarchical clustering are

NAME

No alt text provided for this image

Dendrogram

A dendrogram is a diagram that shows the hierarchical relationship between objects.

It is most created as an output from hierarchical clustering. The main use of a

dendrogram is to work out the best way to allocate objects to clusters. The

dendrogram below shows the hierarchical clustering of six observations shown on the

scatterplot to the left.

No alt text provided for this image

Types of Linkages in Clustering

The process of Hierarchical Clustering involves either clustering sub-clusters (data

points in the first iteration) into larger clusters in a bottom-up manner or dividing a

larger cluster into smaller sub-clusters in a top-down manner. During both the types

of hierarchical clustering, the distance between two sub-clusters needs to be

computed. The different types of linkages describe the different approaches to

measure the distance between two sub-clusters of data points. The different types of

linkages are: -

Single Linkage: For two clusters R and S, the single linkage returns the

minimum distance between two points i and j such that i belongs to R and j

belongs to S.

L (R, S) = min (D (i, j)), i∈R, j∈S

No alt text provided for this image

K-Means Clustering

This method is used to quickly cluster large datasets. Here, researchers define the

number of clusters prior to performing the actual study. This approach is useful when

testing different models with a different assumed number of clusters

Algorithms

It is an iterative algorithm that divides the unlabelled dataset into k different

clusters in such a way that each dataset belongs only one group that has similar

properties.

The k-means clustering algorithm mainly performs two tasks:

● Determines the best value for K centre points or centroids by an iterative

process.

● Assigns each data point to its closest k-centre. Those data points which are

near to the k-centre, create a cluster

Hence each cluster has datapoints with some commonalities, and it is away from

other clusters.


The below diagram explains the working of the K-means Clustering Algorithm:

?

Working of the K-Means algorithm

● Step-1: Select the number K to decide the number of clusters.

● Step-2: Select random K points or centroids. (It can be other from the input

dataset).

● Step-3: Assign each data point to their closest centroid, which will form the

predefined K clusters.

● Step-4: Calculate the variance and place a new centroid of each cluster.

● Step-5: Repeat the third steps, which means reassign each datapoint to the

new closest centroid of each cluster.

● Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

● Step-7: The model is ready

?Choosing the value of "K number of clusters" in K-means Clustering

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of

clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster

Sum of Squares, which defines the total variations within a cluster. The formula to

calculate the value of WCSS (for 3 clusters) is given below:

?WCSS= ∑Pi in Cluster1 distance (Pi C1)2 +∑Pi in Cluster2distance (Pi C2)2+∑Pi in CLuster3 distance (Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the distances between

each data point and its centroid within a cluster1 and the same for the other two

terms.

To measure the distance between data points and centroid, we can use any method

such as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

It executes the K-means clustering on a given dataset for different K values (ranges

from 1-10).

For each value of K, calculates the WCSS value.

Plots a curve between calculated WCSS values and the number of clusters K.

The sharp point of bend or a point of the plot looks like an arm, then that point is

considered as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known

as the elbow method. The graph for the elbow method looks like the below image:

No alt text provided for this image

Relationship in between machine learning, A.I and Deep learning

Artificial Intelligence is the concept of creating smart intelligent machines.

Machine Learning is a subset of artificial intelligence that helps you build AI-driven

applications.

Deep Learning is a subset of machine learning that uses vast volumes of data and

complex algorithms to train a model.

No alt text provided for this image

Types of Artificial Intelligence

● Reactive Machines

● Limited Memory

● Theory of Mind

● Self-awareness

Applications of Artificial Intelligence

● Machine Translation such as Google Translate

● Self-Driving Vehicles such as Google’s Waymo

● AI Robots such as Sophia and Aibo

● Speech Recognition applications like Apple’s Siri or OK Google

Types of Machine Learning

● Supervised Learning

● Unsupervised Learning

● Reinforcement Learning

Machine Learning Applications

● Sales forecasting for different products

● Fraud analysis in banking

● Product recommendations

● Stock price prediction

Types of Deep Neural Networks

● Convolutional Neural Network (CNN)

● Recurrent Neural Network (RNN)

● Generative Adversarial Network (GAN)

● Deep Belief Network (DBN)

Deep Learning Applications

● Cancer tumors detection

● Music generation

● Image colouring

● Object detection

CONCLUSION-

● Cluster analysis groups objects based on their similarity and has a wide application. Measures of similarity can be computed for various types of data.

● Clustering algorithms can be categorized into partitioning methods, hierarchical methods and etc

● k-means algorithms are popular partitioning-based clustering algorithms

● Elbow methods and its formulas have been explained

● Cluster analysis provides a great tool for classifying data sets into groups where the elements in each group share the same characteristics.

● Cluster analysis can be combined with other multivariate techniques such as factor analysis and principal component analysis to provide better analysis of the data.

?

REFERENCES:

1. Applied Linear Regression Models, J. Neter, W. Wasserman and M.H. Kutner.

2. Introduction to Linear Regression Analysis, D.C. Montgomery and E.A. Peck.

3. Cluster Analysis for Applications, M.R. Anderberg.

4. Multivariate Statistical Analysis, D.F. Morrison.

要查看或添加评论,请登录

ANIK C.的更多文章

社区洞察

其他会员也浏览了