Data Mining with Cluster Analysis
José Jaime Comé
Information Management Associate @ UNHCR ? Data Specialist/Statistician (Python||R||SQL||PowerBI||Excel) ? Youtube: 15K+ subscribers
The Cluster analysis is technique of statistical analysis and one of the method of data mining that consist of dividing the unlabeled data or data point into different clusters by identify patterns and relationships within the datasets. Some of the areas that rely on cluster analysis to get insight of data are Customer Segmentation, Credit Risk Management, Fraud Detection, Spam filtering, Intrusion Detection, and others.
Methodologies
Data are grouped by subset in order that each subset share common pattern, in generally, some measure defined as distance. This can mean, if you have ?objects and each one measured by each m variable, group the objects in c classes.
This technique can be applied in 2 problems:
·?????? Data partition: there is a set of data, and it has been desired to group in subset having homogeneity within subset and heterogeneity between subsets.
·?????? Classification of variables: groups variables to reduce their size.
?We start by n x m matrix (n number of objects and m number of variables), or n x n that correspond to the distance of objects.
Let Ξ be sample of ?objects, ?where, our goal is to generate “c” or regions so that we have clusters ?in order to have:
And the matrix provides the values of each variable for each object
The jth of the matrix Χ contains the values of each variable for the jth individual, while the ith column shows the values belonging to the ith variable across the entire individual in the sample. The classification for both type of grouping are carried out in above matrix while those that apply to variables are applied to the transposed matrix. These techniques are highly necessary as an exhaustive search of all partitions is often expensive and are given by a second-class Stirling number. If we want to classify n observations into c groups, the Stirling number of the second type is:
If the number of clusters are known beforehand for n observations, the possible number of clusters will be:
Association Measures
As cluster analysis group data based on distances and similarity, we need to define and set conditions to use in order to group objects:
The greater the distance, the less similar the individuals are. The higher the proximity value, the more similar the individuals are.
Gower shows that it is possible to go from distance to similarity as long as the matrix of similarity coefficients, , is semi-defined positive
Quantitative Variables
The most common type of multivariate categorical data is where variables indicate the presence or absence of a characteristic condition (generally represented by 1 and 0, respectively). Many similarity measures have been proposed for such data. So, let’s have possible conditions:
a represents the n times (variables) that individuals i and j simultaneously assume the value 1.
b represents the n times that individual i assumes the value 1 and individual j assumes the value 0.
c represents the n times that individual i assumes the value 0 and individual j assumes the value 1.
d represents the n times that individual i and j simultaneously assume the value 0
Similarity Measures Formula
Case of Mixed Variables
Gower (1971) proposed that instead of having measures for each type and later combine, let have distance as
Type of Clusters
There are hierarchical and non-hierarchical methods.
hierarchical Method
The objective of these methods is to group clusters to form a new one or separate an existing one to give rise to two others, so that the distance function is minimized, or some similarity measure is maximized. Hierarchical methods can be divided into agglomerative and divisive
Agglomerative Hierarchical clustering: It starts at individual leaves and successfully merges clusters together. It’s a Bottom-up approach. Divisive Hierarchical clustering: It starts at the root and recursively split the clusters. It’s a top-down approach.
In Agglomerative Hierarchical clustering: in the first step, we have as many clusters as the number of points and we gradually merge more and more clusters until all points are in one clusters. In each step the cluster that are close together are always merged.
To know how closest points are, we need to calculate distance between two points using one of these methods: Euclidean Distance, Minkowski Distance, Manhattan or City Block Distance, Chebychev or Maximum Distance and Mahalanobis Distance.
We also need to link the points using one of the follow methods
Other methods are: Centroid-based methods, Ward's method, Divisive or dissociative methods
Non-hierarchical methods
In this method, observations are iteratively assigned in cluster until balance is reached. In many cases, the number of clusters is known beforehand. The difference from Non-hierarchical methods are: Individuals can be changed between clusters, so an individual misclassified in a previous cluster can be reclassified in a later stage; If the data structure is known in advance, there will be no problem in choosing the cluster number. If it doesn't, you can try different partitions or first use a hierarchical technique to help decide the number of clusters
Type of non-hierarchical methods:
? Relocation or partition methods:
a) K-Means
b) K-Medoids
? Density methods:
a) Based on topological density: Mean Shift and DBSCAN
b) Based on probability density: EM-GMM
Cluster evaluation statistics
It is important to evaluate the quality of results from different algorithm. Is good also to know the optimum number of Clusters. Internal validation measures generally reflect compactness (how close objects are within the same cluster) and separation of group partitions (how well separated a cluster is from other clusters).
Generally, most indexes used for internal cluster validation combine compactness (C) and separation (S) measures, such as the following
Below, three of the most used validation coefficients:
?Silhouette coefficient by Rousseeuw (1987)
The maximum index value is used to determine the optimal number of groups in the data for hierarchical methods.
Observations with a large value of S(i) reproduce good clustering, a small S(i) (close to 0) indicate that the observations are between two groups, and observations with a negative value of S(i) are probably still in the wrong group.
Gap Statistic Coefficient: was published by R. Tibshirani, G. Walther, and T. Hastie (Stanford University, 2001). The basic idea is to compare the change in within-cluster dispersion with that expected under a reference distribution for an increasing number of clusters
Dunn Index: (Dunn 1974) defines the relationships between the inter-cluster minimum and the intra-cluster maximum distance
?Calinski-Harabasz (CH) Index
Agglomerative Coefficient: is a global clustering measure associated with hierarchical agglomerative methods. In other words, it does not compare the adequacy of the number of clusters chosen, but it allows comparing dendrogram structures, or what is the same, dendrograms produced by different hierarchical methods
.If for each j we define dj as the quotient between the dissimilarity with the first cluster to which it was joined and the dissimilarity in the last union of the algorithm. The closer to 1, the better structure the dendrogram obtained will have. It is also the average width of the flag chart (or filled proportion of this chart). The Agglomerative Coefficient grows with the number of observations, so it should not be used to compare datasets with very different sample sizes.