Cluster Analysis
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.
- Based on information found in the data that describes the objects and their relationships.
- Also known as unsupervised classification.
Many applications
- Understanding: group related documents for browsing or to find genes and proteins that have similar functionality.
- Summarization: Reduce the size of large data sets.
Web Documents are divided into groups based on a similarity metric.
- Most common similarity metric is the dot product between two document vectors.
What is not Cluster Analysis?
Supervised classification.
- Have class label information.
Simple segmentation.
- Dividing students into different registration groups alphabetically, by last name.
Results of a query.
- Groupings are a result of an external specification.
Graph partitioning
- Some mutual relevance and synergy, but areas are not identical.
Types of Clusterings
A clustering is a set of clusters.
One important distinction is between hierarchical and partitional sets of clusters.
Partitional Clustering
- A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.
Hierarchical clustering
- A set of nested clusters organized as a hierarchical tree.
DIAW Serigne Data Engineer, Data Scientist at Business and Decision paper : https://www.ieee.org.ar/downloads/Srivastava-tut-pres.pdf