Week 6: Unsupervised Machine Learning: Practical Overview and Applications
Alaaeddin Alweish
Solutions Architect & Lead Developer | Semantic AI | Graph Data Engineering & Analysis
In our previous article, we explored supervised learning in detail. This week, we will dive into another major branch of machine learning: unsupervised learning. We'll look at the definition and types of unsupervised learning, its key algorithms, and real-world applications.
This article aims to provide a clear overview and practical examples. It starts with the basics for non-technical readers and gradually moves into the logic and implementation of key algorithms, using simplified pseudo-code to explain how they work.
1- What is Unsupervised Learning?
It's a type of machine learning where the model is trained on an unlabeled dataset. Unlike supervised learning, where the goal is to predict a known output, unsupervised learning aims to discover hidden patterns or intrinsic structures in the input data.
Think of it as exploring a new city without a map or guide, discovering interesting places, and understanding the layout on your own. The model groups similar data points together and identifies patterns that help us make sense of the data.
1.1. Types of Unsupervised Learning:
1.2. Key Concepts in Unsupervised Learning:
2. Real-World Applications:
Unsupervised learning has numerous real-world applications across various domains. Here are some examples, emphasizing the use of features and pattern discovery:
2.1. Marketing:
2.2. Healthcare:
2.3. Finance:
2.4. Retail:
2.5. Telecommunications:
2.6. Manufacturing:
2.7. Environmental Science
2.8. Urban Planning
2.9. Social Networks:
3. Let's Get More Technical
If you're curious about the technical details, this section is for you. We'll uncover more about unsupervised learning concepts, metrics, and key algorithms:
3.1. Feature Engineering and Data Preprocessing:
In unsupervised learning, as in supervised learning, feature engineering and data preprocessing are essential. However, the focus and techniques might differ to suit the goals of unsupervised learning, which primarily involves discovering patterns and structures in unlabeled data.
3.1.1. Feature Engineering
3.1.2. Feature Scaling
3.1.3. Data Cleaning
3.1.4. Data Transformation
3.2. Evaluation Metrics:
Unlike supervised learning, unsupervised learning doesn't have clear-cut labels for evaluation. However, we can use some metrics to assess the quality of the models:
3.2.1. Clustering Metrics:
3.2.2. Dimensionality Reduction Metrics:
3.3. Key Algorithms in Unsupervised Learning
Let's dive into the top 5 algorithms. We'll break down their concepts, share some practical examples, and explain how it works in simplified pseudo-code format to clarify the logic and steps of each algorithm:
3.3.1. K-Means Clustering
Concept:
K-Means is a popular clustering algorithm that partitions data into K distinct clusters based on feature similarity. Each cluster is represented by its centroid, which is the mean of all data points in the cluster.
Example:
Suppose you want to segment customers based on their purchasing behavior. K-Means can group customers into clusters where each cluster represents a group of customers with similar purchase patterns.
How It Works:
- Initialize K centroids randomly
- Repeat until convergence:
- For each customer:
- Calculate the distance to each centroid based on features such as purchase history and demographics
- Assign the customer to the nearest centroid
- Update centroids by calculating the mean of all customers in each cluster
- Return the final centroids and clusters
3.3.2. Hierarchical Clustering
Concept:
Hierarchical clustering builds a tree-like structure of nested clusters by either merging or splitting clusters recursively. There are two main types: Agglomerative (bottom-up approach) and Divisive (top-down approach).
Example:
Hierarchical clustering can group genes with similar expression patterns in bioinformatics, creating a dendrogram to visualize the hierarchy of clusters.
How It Works (Agglomerative):
- Start with each gene as a single cluster
- Repeat until only one cluster remains:
- Find the two closest clusters based on expression patterns
- Merge them into a single cluster
- Return the hierarchy of clusters
3.3.3. Principal Component Analysis (PCA)
Concept:
PCA is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional form while retaining most of the variance in the data. It identifies the principal components, which are linear combinations of the original features.
Example:
PCA can reduce the dimensionality of a dataset with many features, such as a dataset of images with thousands of pixel values, making it easier to analyze and visualize.
How It Works:
- Standardize the data
- Calculate the covariance matrix
- Compute eigenvectors and eigenvalues
- Select the top k eigenvectors
- Transform the data using the selected eigenvectors
- Return the transformed data
3.3.4. t-Distributed Stochastic Neighbor Embedding (t-SNE)
Concept:
t-SNE is a nonlinear dimensionality reduction technique that maps high-dimensional data to a lower-dimensional space, typically 2 or 3 dimensions, for visualization. It minimizes the divergence between probability distributions of data points in high-dimensional and low-dimensional spaces.
Example:
t-SNE is commonly used to visualize complex datasets like handwritten digits or word embeddings, where the structure of the data is difficult to capture in high dimensions.
How It Works:
- Compute pairwise similarities in high-dimensional space for each digit
- Define probability distributions for high and low-dimensional spaces
- Minimize divergence between distributions by adjusting point positions in low-dimensional space
- Return the low-dimensional representation
3.3.5. Apriori Algorithm
Concept:
The Apriori algorithm is used for mining frequent itemsets and discovering association rules in transactional datasets. It identifies itemsets that appear frequently together and derives rules indicating how the presence of one item affects the presence of another.
Example:
In a retail setting, Apriori can identify products often bought together. For instance, parents buying baby products like diapers and formula also tend to buy more coffee. This helps in market basket analysis and cross-selling strategies.
How It Works:
1. Initialize candidate itemsets of length 1:
- Start with each product as a single itemset.
2. Repeat until no more frequent itemsets are found:
- Count the occurrences of each candidate itemset in the transaction dataset.
- Retain the itemsets that meet the minimum support threshold.
- Generate new candidate itemsets by joining the retained itemsets.
3. Generate association rules from the frequent itemsets:
- For each frequent itemset, find all non-empty subsets.
- For every subset, calculate the confidence of the rule: (itemset - subset) => subset.
- Retain the rules that meet the minimum confidence threshold.
4. Return the association rules:
- Rules like "If diapers, then coffee" can be derived if they meet the support and confidence thresholds.
3.4. Other Common Algorithms
Here's a brief explanation of other significant algorithms:
3.5. Common Challenges and Methods
Here are some common challenges in unsupervised learning and the methods used to overcome them.
Dealing with High-Dimensional Data
Model Selection and Validation in Unsupervised Learning
Handling Imbalanced Data in Clustering
Are you a developer interested in practical examples?
The practical exercises in the excellent notebook below will help you solidify key concepts in unsupervised learning. It dives into techniques like clustering with K-Means, teaching you how to apply it, visualize decision boundaries, handle variability, and determine the best number of clusters. It also covers DBSCAN, spectral clustering, agglomerative clustering, and Gaussian mixtures for both clustering and anomaly detection.
The project is created by Aurélien Géron the Author of the book "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" and the Former PM of YouTube video classification.
https://colab.research.google.com/github/ageron/handson-ml3/blob/main/09_unsupervised_learning.ipynb
Conclusion
Unsupervised learning is a major branch of machine learning that focuses on discovering hidden patterns and structures in data without labeled outputs. We explored its key concepts, such as clustering and dimensionality reduction, and discussed real-world applications across various domains. We also explored the technical details by reviewing essential algorithms like K-Means, hierarchical clustering, PCA, t-SNE, and Apriori, and listed other common algorithms.
Learning these core techniques will empower you to tackle a wide range of challenges in unsupervised learning and enhance your ability to extract meaningful insights from unlabeled data.
In this Zero to Hero: Learn AI Newsletter, we will publish one article weekly (or biweekly for in-depth articles). Next week, we'll dive deeper into Reinforcement Learning. Check out the plan of this series here:
Share your thoughts, questions, and suggestions in the comments section.
Help others by sharing this article and join us in shaping this learning journey ????.