Week 13 of Data Science : Machine Learning and Feature Engineering Part 1
"Algorithms cannot replace human insight, but they can amplify it." - Fei-Fei Li?
Embarking on another exhilarating week of my data science journey with Krish Sir from PWSkills! ?? This week's focus was on two exciting topics: Machine Learning and Feature Engineering. Dived into the captivating world of AI, ML, and DL and explored the nuances of supervised, unsupervised, semi-supervised, and reinforcement learning. Also unraveled the intricacies of datasets, anomaly detection, model fitting, and the delicate balance of bias and variance. Let's get started!
AI vs ML vs DL vs DS
Definition: Understanding the distinctions between Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL), and Data Science (DS).
Real-Life Usage: AI simulates human intelligence, ML enables systems to learn from data, DL deals with neural networks and complex data, and DS involves extracting insights from data.
Types of ML: Supervised, Unsupervised, Semi-Supervised, Reinforcement Learning
Definition: Exploring different ML paradigms based on data and learning approaches.
Concept: Supervised learning uses labelled data, unsupervised learning identifies patterns in unlabeled data, semi-supervised combines labelled and unlabeled data, and reinforcement learning involves training agents through rewards and punishments.
Real-Life Usage: Supervised for predicting, unsupervised for clustering, semi-supervised for limited labelled data, and reinforcement learning for game-playing AI.
About Dataset: Training, Validation, Test
Definition: Understanding the three key subsets of data used in ML: training, validation, and test sets.
Real-Life Usage: Training set to build the model, validation set to tune hyperparameters, and test set to evaluate the model's performance.
Anomaly Detection
Definition: Identifying rare instances that significantly differ from the norm.
Real-Life Usage: Fraud detection in banking, fault detection in manufacturing, and identifying outliers in healthcare data.
Overfitting, Underfitting, and Generalized Model
Definition: Exploring the challenges of model fitting: overfitting (high complexity), underfitting (low complexity), and achieving a balanced, generalized model.
Check/Detect: Overfitting can be detected if the model performs well on training data but poorly on unseen data. Underfitting is identified when the model performs poorly on both training and test data.
Techniques to Reduce: For overfitting, reduce model complexity, increase data, or use regularization techniques like Lasso or Ridge. For underfitting, increase model complexity or gather more relevant features.
Bias and Variance, Bias-Variance Tradeoff
Effect on Model: High bias results in systematic errors, while high variance leads to sensitivity to small fluctuations in training data.
Bias-Variance Tradeoff: Achieving a balance between bias and variance is crucial. Lowering bias can increase variance and vice versa.
Missing Values and Types: MCAR, MAR, MNAR
Detection: MCAR (Missing Completely at Random), MAR (Missing at Random), MNAR (Missing Not at Random). Check patterns of missing values and their potential correlations with other variables.
Handling: Delete rows/columns with too many missing values. Impute values using mean, median, or mode for MCAR and MAR. Use advanced methods for MNAR.
领英推荐
Imputation Techniques in Python:
import pandas as pd
# Assuming df is your DataFrame
# For MCAR and MAR
df.fillna(df.mean(), inplace=True) # Impute missing values with mean
Imbalanced Data, Handling Techniques
Techniques: Upsampling (increasing minority class instances), downsampling (reducing majority class instances), SMOTE (Synthetic Minority Over-sampling Technique), and data interpolation with linear, cubic, or polynomial methods.
SMOTE in Python:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
Outliers and Handling Outliers: 5 Number Summary
Definition: Outliers are data points that deviate significantly from the rest of the data.
Handling: Identify outliers using the 5 Number Summary (min, Q1, median, Q3, max) and interquartile range (IQR). Consider removing or transforming extreme outliers.
IQR Method for Outlier Detection and Removal:
Q1 = df['Column'].quantile(0.25)
Q3 = df['Column'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_removed = df[(df['Column'] > lower_bound) & (df['Column'] < upper_bound)]
Feature Extraction and Types
Feature Scaling: Standardization (z-score normalization), Normalization (Min-Max scaling), Unit Vectors (scaling to unit length).
Feature Selection: Filter (statistical tests), Wrapper (model performance), Embedded (combining with model training).
PCA (Principal Component Analysis): Steps involve mean centering, covariance matrix computation, eigenvector and eigenvalue calculation, selecting principal components based on the variance explained, and projecting data onto selected components.
Principal Component Analysis (PCA) with scikit-learn:
from sklearn.decomposition import PCA
pca = PCA(n_components=2) # Set desired number of principal components
X_pca = pca.fit_transform(X) # X is your data matrix
Data Encoding and Types: Nominal, OHE, Label, Ordinal, Target Guided Ordinal
Nominal and One-Hot Encoding (OHE): For categorical data without order, use OHE for each category.
Label and Ordinal Encoding: Label encoding assigns numerical labels, and ordinal encoding preserves order in categorical variables.
Target Guided Ordinal Encoding: Assigning labels based on the target variable's mean.
Covariance and Correlation
Covariance: A measure of how changes in one variable are associated with changes in another. Formula: Cov(X, Y) = Σ[(X - X?) * (Y - ?)] / (n - 1).
Correlation (Pearson, Spearman Rank): Pearson measures linear relationships, and Spearman assesses monotonic relationships between variables.
Stay curious and keep exploring the amazing world of data! ???? #DataScience #MachineLearning #FeatureEngineering #DataMagic