Week 13 of Data Science : Machine Learning and Feature Engineering Part 1

Week 13 of Data Science : Machine Learning and Feature Engineering Part 1

"Algorithms cannot replace human insight, but they can amplify it." - Fei-Fei Li?
Embarking on another exhilarating week of my data science journey with Krish Sir from PWSkills! ?? This week's focus was on two exciting topics: Machine Learning and Feature Engineering. Dived into the captivating world of AI, ML, and DL and explored the nuances of supervised, unsupervised, semi-supervised, and reinforcement learning. Also unraveled the intricacies of datasets, anomaly detection, model fitting, and the delicate balance of bias and variance. Let's get started!
PW (PhysicsWallah) iNeuron.ai Krish Naik

AI vs ML vs DL vs DS

No alt text provided for this image

Definition: Understanding the distinctions between Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL), and Data Science (DS).

Real-Life Usage: AI simulates human intelligence, ML enables systems to learn from data, DL deals with neural networks and complex data, and DS involves extracting insights from data.

Types of ML: Supervised, Unsupervised, Semi-Supervised, Reinforcement Learning

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Definition: Exploring different ML paradigms based on data and learning approaches.

Concept: Supervised learning uses labelled data, unsupervised learning identifies patterns in unlabeled data, semi-supervised combines labelled and unlabeled data, and reinforcement learning involves training agents through rewards and punishments.

Real-Life Usage: Supervised for predicting, unsupervised for clustering, semi-supervised for limited labelled data, and reinforcement learning for game-playing AI.

No alt text provided for this image

About Dataset: Training, Validation, Test

No alt text provided for this image

Definition: Understanding the three key subsets of data used in ML: training, validation, and test sets.

No alt text provided for this image

Real-Life Usage: Training set to build the model, validation set to tune hyperparameters, and test set to evaluate the model's performance.

Anomaly Detection

No alt text provided for this image

Definition: Identifying rare instances that significantly differ from the norm.

Real-Life Usage: Fraud detection in banking, fault detection in manufacturing, and identifying outliers in healthcare data.

Overfitting, Underfitting, and Generalized Model

No alt text provided for this image

Definition: Exploring the challenges of model fitting: overfitting (high complexity), underfitting (low complexity), and achieving a balanced, generalized model.

No alt text provided for this image

Check/Detect: Overfitting can be detected if the model performs well on training data but poorly on unseen data. Underfitting is identified when the model performs poorly on both training and test data.

No alt text provided for this image

Techniques to Reduce: For overfitting, reduce model complexity, increase data, or use regularization techniques like Lasso or Ridge. For underfitting, increase model complexity or gather more relevant features.

Bias and Variance, Bias-Variance Tradeoff

No alt text provided for this image

Effect on Model: High bias results in systematic errors, while high variance leads to sensitivity to small fluctuations in training data.

Bias-Variance Tradeoff: Achieving a balance between bias and variance is crucial. Lowering bias can increase variance and vice versa.

Missing Values and Types: MCAR, MAR, MNAR

No alt text provided for this image

Detection: MCAR (Missing Completely at Random), MAR (Missing at Random), MNAR (Missing Not at Random). Check patterns of missing values and their potential correlations with other variables.

No alt text provided for this image

Handling: Delete rows/columns with too many missing values. Impute values using mean, median, or mode for MCAR and MAR. Use advanced methods for MNAR.

No alt text provided for this image

Imputation Techniques in Python:

import pandas as pd 
# Assuming df is your DataFrame 
# For MCAR and MAR 
df.fillna(df.mean(), inplace=True) # Impute missing values with mean         

Imbalanced Data, Handling Techniques

Techniques: Upsampling (increasing minority class instances), downsampling (reducing majority class instances), SMOTE (Synthetic Minority Over-sampling Technique), and data interpolation with linear, cubic, or polynomial methods.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

SMOTE in Python:

from imblearn.over_sampling import SMOTE 
smote = SMOTE() 
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)        

Outliers and Handling Outliers: 5 Number Summary

Definition: Outliers are data points that deviate significantly from the rest of the data.

No alt text provided for this image

Handling: Identify outliers using the 5 Number Summary (min, Q1, median, Q3, max) and interquartile range (IQR). Consider removing or transforming extreme outliers.

IQR Method for Outlier Detection and Removal:

Q1 = df['Column'].quantile(0.25) 
Q3 = df['Column'].quantile(0.75) 
IQR = Q3 - Q1 
lower_bound = Q1 - 1.5 * IQR 
upper_bound = Q3 + 1.5 * IQR 
outliers_removed = df[(df['Column'] > lower_bound) & (df['Column'] < upper_bound)]        

Feature Extraction and Types

Feature Scaling: Standardization (z-score normalization), Normalization (Min-Max scaling), Unit Vectors (scaling to unit length).

No alt text provided for this image

Feature Selection: Filter (statistical tests), Wrapper (model performance), Embedded (combining with model training).

No alt text provided for this image

PCA (Principal Component Analysis): Steps involve mean centering, covariance matrix computation, eigenvector and eigenvalue calculation, selecting principal components based on the variance explained, and projecting data onto selected components.

No alt text provided for this image

Principal Component Analysis (PCA) with scikit-learn:

from sklearn.decomposition import PCA 
pca = PCA(n_components=2) # Set desired number of principal components 
X_pca = pca.fit_transform(X) # X is your data matrix        

Data Encoding and Types: Nominal, OHE, Label, Ordinal, Target Guided Ordinal

No alt text provided for this image

Nominal and One-Hot Encoding (OHE): For categorical data without order, use OHE for each category.

No alt text provided for this image
No alt text provided for this image

Label and Ordinal Encoding: Label encoding assigns numerical labels, and ordinal encoding preserves order in categorical variables.

No alt text provided for this image

Target Guided Ordinal Encoding: Assigning labels based on the target variable's mean.

No alt text provided for this image

Covariance and Correlation

Covariance: A measure of how changes in one variable are associated with changes in another. Formula: Cov(X, Y) = Σ[(X - X?) * (Y - ?)] / (n - 1).

No alt text provided for this image

Correlation (Pearson, Spearman Rank): Pearson measures linear relationships, and Spearman assesses monotonic relationships between variables.

No alt text provided for this image

Stay curious and keep exploring the amazing world of data! ???? #DataScience #MachineLearning #FeatureEngineering #DataMagic

要查看或添加评论,请登录

Varsha Biswal的更多文章

社区洞察

其他会员也浏览了