登录查看更多内容

CLUSTER ANALYSIS

ANIK C.

|| DS & ML Loading ... || BI Developer || UG-(B.Tech-HONS) CSE[BS]-25'@WBUT || Aspiring towards 25@GATE & ICRB || MlOps || Ex CRLA & BRV || Aspiring Machine Learning Engineer || GenAI & Deep Learning

发布日期: 2022年9月2日

ABSTRACT:

Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the process of grouping similar objects into different groups, or more precisely, the partitioning of a data set into subsets, so that the data in each subset according to some defined distance measure. This paper covers about clustering algorithms, benefits and its applications. Paper concludes by discussing some limitations.

INTRODUCTION:

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

?The goal of performing a cluster analysis is to sort different objects or data points into groups in a manner that the degree of association between two objects is high if they belong to the same group, and low if they belong to different groups.

?Cluster analysis differs from many other statistical methods due to the fact that it’s mostly used when researchers do not have an assumed principle or fact that they are using as the foundation of their research.

This analysis technique is typically performed during the exploratory phase of research, since unlike techniques such as factor analysis, it doesn’t make any distinction between dependent and independent variables. Instead, cluster analysis is leveraged mostly to discover structures in data without providing an explanation or interpretation.

?Put simply, cluster analysis discovers structures in data without explaining why those structures exist.

?For example, when cluster analysis is performed as part of market research, specific groups can be identified within a population. The analysis of these groups can then determine how likely a population cluster is to purchase products or services. If these groups are defined clearly, a marketing team can then target varying cluster with tailored, targeted communication.

There are three primary methods used to perform cluster analysis

● Hierarchical Cluster

● K-Means Cluster

● Two-Step Cluster

PROCEDURE AND DISCUSSION:

Hierarchical clustering

This is the most common method of clustering. It creates a series of models with cluster

solutions from 1 (all cases in one cluster) to n (each case is an individual cluster).

Strategies for hierarchical clustering generally fall into two types:

● Agglomerative: This is a "bottom-up" approach: each observation starts in its

own cluster, and pairs of clusters are merged as one moves up the hierarchy.

● Divisive: This is a "top-down" approach: all observations start in one cluster,

and splits are performed recursively as one moves down the hierarchy

Algorithm

The algorithm for Agglomerative Hierarchical Clustering is:

● Calculate the similarity of one cluster with all the other clusters (calculate

proximity matrix)

● Consider every data point as an individual cluster

● Merge the clusters which are highly similar or close to each other.

● Recalculate the proximity matrix for each cluster

● Repeat Steps 3 and 4 until only a single cluster remains.

Steps of Divisive Clustering or DIANA Hierarchical Clustering

● Initially, all points in the dataset belong to one single cluster.

● Partition the cluster into two least similar cluster

● Proceed recursively to form new clusters until the desired number of clusters is

obtained.

In order to decide which clusters should be combined (for agglomerative), or where a

cluster should be split (for divisive), a measure of dissimilarity between sets of

observations are required. In most methods of hierarchical clustering, this is achieved by

use of an appropriate metric (a measure of distance between pairs of observations),

and a linkage criterion which specifies the dissimilarity of sets as a function of the

pairwise distances of observations in the sets

Metric

Some commonly used metrics for hierarchical clustering are

NAME

Dendrogram

A dendrogram is a diagram that shows the hierarchical relationship between objects.

It is most created as an output from hierarchical clustering. The main use of a

dendrogram is to work out the best way to allocate objects to clusters. The

dendrogram below shows the hierarchical clustering of six observations shown on the

scatterplot to the left.

Types of Linkages in Clustering

The process of Hierarchical Clustering involves either clustering sub-clusters (data

points in the first iteration) into larger clusters in a bottom-up manner or dividing a

larger cluster into smaller sub-clusters in a top-down manner. During both the types

of hierarchical clustering, the distance between two sub-clusters needs to be

computed. The different types of linkages describe the different approaches to

measure the distance between two sub-clusters of data points. The different types of

linkages are: -

● Single Linkage: For two clusters R and S, the single linkage returns the

minimum distance between two points i and j such that i belongs to R and j

belongs to S.

L (R, S) = min (D (i, j)), i∈R, j∈S

K-Means Clustering

This method is used to quickly cluster large datasets. Here, researchers define the

number of clusters prior to performing the actual study. This approach is useful when

testing different models with a different assumed number of clusters

Algorithms

It is an iterative algorithm that divides the unlabelled dataset into k different

clusters in such a way that each dataset belongs only one group that has similar

properties.

The k-means clustering algorithm mainly performs two tasks:

● Determines the best value for K centre points or centroids by an iterative

process.

● Assigns each data point to its closest k-centre. Those data points which are

near to the k-centre, create a cluster

Hence each cluster has datapoints with some commonalities, and it is away from

other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

领英推荐

Maximum Likelihood Estimation

Marcin Majka 5 个月前

Clustering: Unveiling Patterns and Relationships in…

Massimo Re 1 年前

21 Must-Know Data Science Interview Questions and…

Gregory Piatetsky-Shapiro 9 年前

Working of the K-Means algorithm

● Step-1: Select the number K to decide the number of clusters.

● Step-2: Select random K points or centroids. (It can be other from the input

dataset).

● Step-3: Assign each data point to their closest centroid, which will form the

predefined K clusters.

● Step-4: Calculate the variance and place a new centroid of each cluster.

● Step-5: Repeat the third steps, which means reassign each datapoint to the

new closest centroid of each cluster.

● Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

● Step-7: The model is ready

?Choosing the value of "K number of clusters" in K-means Clustering

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of

clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster

Sum of Squares, which defines the total variations within a cluster. The formula to

calculate the value of WCSS (for 3 clusters) is given below:

?WCSS= ∑Pi in Cluster1 distance (Pi C1)2 +∑Pi in Cluster2distance (Pi C2)2+∑Pi in CLuster3 distance (Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the distances between

each data point and its centroid within a cluster1 and the same for the other two

terms.

To measure the distance between data points and centroid, we can use any method

such as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

It executes the K-means clustering on a given dataset for different K values (ranges

from 1-10).

For each value of K, calculates the WCSS value.

Plots a curve between calculated WCSS values and the number of clusters K.

The sharp point of bend or a point of the plot looks like an arm, then that point is

considered as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known

as the elbow method. The graph for the elbow method looks like the below image:

Relationship in between machine learning, A.I and Deep learning

Artificial Intelligence is the concept of creating smart intelligent machines.

Machine Learning is a subset of artificial intelligence that helps you build AI-driven

applications.

Deep Learning is a subset of machine learning that uses vast volumes of data and

complex algorithms to train a model.

Types of Artificial Intelligence

● Reactive Machines

● Limited Memory

● Theory of Mind

● Self-awareness

Applications of Artificial Intelligence

● Machine Translation such as Google Translate

● Self-Driving Vehicles such as Google’s Waymo

● AI Robots such as Sophia and Aibo

● Speech Recognition applications like Apple’s Siri or OK Google

Types of Machine Learning

● Supervised Learning

● Unsupervised Learning

● Reinforcement Learning

Machine Learning Applications

● Sales forecasting for different products

● Fraud analysis in banking

● Product recommendations

● Stock price prediction

Types of Deep Neural Networks

● Convolutional Neural Network (CNN)

● Recurrent Neural Network (RNN)

● Generative Adversarial Network (GAN)

● Deep Belief Network (DBN)

Deep Learning Applications

● Cancer tumors detection

● Music generation

● Image colouring

● Object detection

CONCLUSION-

● Cluster analysis groups objects based on their similarity and has a wide application. Measures of similarity can be computed for various types of data.

● Clustering algorithms can be categorized into partitioning methods, hierarchical methods and etc

● k-means algorithms are popular partitioning-based clustering algorithms

● Elbow methods and its formulas have been explained

● Cluster analysis provides a great tool for classifying data sets into groups where the elements in each group share the same characteristics.

● Cluster analysis can be combined with other multivariate techniques such as factor analysis and principal component analysis to provide better analysis of the data.

REFERENCES:

1. Applied Linear Regression Models, J. Neter, W. Wasserman and M.H. Kutner.

2. Introduction to Linear Regression Analysis, D.C. Montgomery and E.A. Peck.

3. Cluster Analysis for Applications, M.R. Anderberg.

4. Multivariate Statistical Analysis, D.F. Morrison.

要查看或添加评论，请登录

ANIK C.的更多文章

Unveiling the Tensor Dimensions: A Journey from Scalars to Higher-Dimensional Data in Machine Learning

2024年4月24日

Unveiling the Tensor Dimensions: A Journey from Scalars to Higher-Dimensional Data in Machine Learning

Tensors are fundamental data structures that play a crucial role in the field of machine learning and deep learning…
Demystifying the Machine Learning Development Life Cycle: A Comprehensive Guide

2024年4月21日

Demystifying the Machine Learning Development Life Cycle: A Comprehensive Guide

In today's data-driven world, machine learning has emerged as a transformative technology, revolutionizing industries…

1 条评论
Aditya L1: India's First Solar Mission Set to Illuminate Solar Secrets

2023年8月29日

Aditya L1: India's First Solar Mission Set to Illuminate Solar Secrets

The Indian Space Research Organization (ISRO) is on the brink of a historic achievement as it prepares to launch its…

2 条评论
A Comprehensive 6-Month Roadmap for GATE CSE/IT Preparation: Mastering the Graduate Aptitude Test for Engineers

2023年7月20日

A Comprehensive 6-Month Roadmap for GATE CSE/IT Preparation: Mastering the Graduate Aptitude Test for Engineers

Introduction The Graduate Aptitude Test for Engineers (GATE) for Computer Science and Information Technology (CSE/IT)…
Chandrayaan-3: India's Trailblazing Lunar Odyssey

2023年7月15日

Chandrayaan-3: India's Trailblazing Lunar Odyssey

In an ambitious leap towards the uncharted wonders of the Moon, India embarks on its third lunar exploration mission:…
Excellence in Research and Teaching: The Department of CSE at IIT Kharagpur (P.G)

2023年6月30日

Excellence in Research and Teaching: The Department of CSE at IIT Kharagpur (P.G)

The Department of Computer Science and Engineering at the Indian Institute of Technology Kharagpur, was formed in 1980.…

1 条评论
Unleashing Data's Potential: The Rise of Data Analyst 2.0 in an AI-Driven World

2023年5月21日

Unleashing Data's Potential: The Rise of Data Analyst 2.0 in an AI-Driven World

Introduction: In today's data-driven world, the role of a data analyst has evolved significantly. From the traditional…
Demystifying Data Science: Exploring Definitions, Applications, and the Workflow of a Data-Driven World

2023年5月16日

Demystifying Data Science: Exploring Definitions, Applications, and the Workflow of a Data-Driven World

Introduction to Data Science: Data Science is a multidisciplinary field that combines scientific methods, algorithms…
Benefits and Opportunities in Data Science & Business Intelligence

2023年5月9日

Benefits and Opportunities in Data Science & Business Intelligence

What do the phrases "data science" and "business intelligence," which are frequently employed in the business sector…
Remembering Rabindranath Tagore: Honouring His Life and Contributions to Literature

2023年5月9日

Remembering Rabindranath Tagore: Honouring His Life and Contributions to Literature

Bengali poet, author, philosopher, and polymath Rabindranath Tagore—also known as Gurudev—made enormous contributions…

See all articles

CLUSTER ANALYSIS

ANIK C.

|| DS & ML Loading ... || BI Developer || UG-(B.Tech-HONS) CSE[BS]-25'@WBUT || Aspiring towards 25@GATE & ICRB || MlOps || Ex CRLA & BRV || Aspiring Machine Learning Engineer || GenAI & Deep Learning

领英推荐

ANIK C.的更多文章

社区洞察

其他会员也浏览了

Data Mining in Clinical Trials

Clustering Algorithms

Understanding Normal Distribution A Comprehensive Guide

Spectral Analysis Techniques in Data Science

Remembering the Birth of Data Mining and Predictive Analytics: A Look Back at the Origins of Modern Insights

The Sexiest job of the 21st century: Harvard Business Review

The Power of Entropy in Data Science: Insights and Applications

Graph Theory and Network Analysis in Data Science

Adaptive Hierarchical Clustering, Gaussian Mixture Models (GMM), and Expectation-Maximization

Research Leaders on Data Science, Big Data key trends, top papers

领英推荐

ANIK C.的更多文章

Unveiling the Tensor Dimensions: A Journey from Scalars to Higher-Dimensional Data in Machine Learning

Demystifying the Machine Learning Development Life Cycle: A Comprehensive Guide

Aditya L1: India's First Solar Mission Set to Illuminate Solar Secrets

A Comprehensive 6-Month Roadmap for GATE CSE/IT Preparation: Mastering the Graduate Aptitude Test for Engineers

Chandrayaan-3: India's Trailblazing Lunar Odyssey

Excellence in Research and Teaching: The Department of CSE at IIT Kharagpur (P.G)

Unleashing Data's Potential: The Rise of Data Analyst 2.0 in an AI-Driven World

Demystifying Data Science: Exploring Definitions, Applications, and the Workflow of a Data-Driven World

Benefits and Opportunities in Data Science & Business Intelligence

Remembering Rabindranath Tagore: Honouring His Life and Contributions to Literature

社区洞察

其他会员也浏览了

Data Mining in Clinical Trials

Clustering Algorithms

Understanding Normal Distribution A Comprehensive Guide

Spectral Analysis Techniques in Data Science

Remembering the Birth of Data Mining and Predictive Analytics: A Look Back at the Origins of Modern Insights

The Sexiest job of the 21st century: Harvard Business Review

The Power of Entropy in Data Science: Insights and Applications

Graph Theory and Network Analysis in Data Science

Adaptive Hierarchical Clustering, Gaussian Mixture Models (GMM), and Expectation-Maximization

Research Leaders on Data Science, Big Data key trends, top papers