PCA in Machine Learning & Data Science
Dhiraj Patra
Cloud-Native Architect | AI, ML, GenAI Innovator & Mentor | Quantitative Financial Analyst
Principal Component Analysis (PCA) in Data Science
PCA is a dimensionality reduction technique used to simplify complex datasets while preserving as much variability as possible. It does so by transforming the data into a new coordinate system defined by its principal components.
Key Concepts:
Why PCA and Eigen Concepts are Important in Data Science:
Applications in Data Science:
Here’s a small Python example of PCA using NumPy:
import numpy as np
# Example dataset (3 samples, 2 features)
data = np.array([[2.5, 2.4],
[0.5, 0.7],
[2.2, 2.9]])
# Step 1: Standardize the data (mean = 0)
mean = np.mean(data, axis=0)
data_centered = data - mean
# Step 2: Compute the covariance matrix
cov_matrix = np.cov(data_centered.T)
# Step 3: Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Step 4: Sort eigenvalues and eigenvectors
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_indices]
eigenvectors = eigenvectors[:, sorted_indices]
# Step 5: Transform data into the new PCA space
projected_data = np.dot(data_centered, eigenvectors)
# Output
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:\n", eigenvectors)
print("Projected Data:\n", projected_data)
Explanation: