登录查看更多内容

Week 13 of Data Science : Machine Learning and Feature Engineering Part 1

Varsha Biswal

Associate Software Engineer At Accenture In India | SAP | FICA

发布日期: 2023年8月15日

"Algorithms cannot replace human insight, but they can amplify it." - Fei-Fei Li?

Embarking on another exhilarating week of my data science journey with Krish Sir from PWSkills! ?? This week's focus was on two exciting topics: Machine Learning and Feature Engineering. Dived into the captivating world of AI, ML, and DL and explored the nuances of supervised, unsupervised, semi-supervised, and reinforcement learning. Also unraveled the intricacies of datasets, anomaly detection, model fitting, and the delicate balance of bias and variance. Let's get started!

PW (PhysicsWallah) iNeuron.ai Krish Naik

AI vs ML vs DL vs DS

Definition: Understanding the distinctions between Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL), and Data Science (DS).

Real-Life Usage: AI simulates human intelligence, ML enables systems to learn from data, DL deals with neural networks and complex data, and DS involves extracting insights from data.

Types of ML: Supervised, Unsupervised, Semi-Supervised, Reinforcement Learning

Definition: Exploring different ML paradigms based on data and learning approaches.

Concept: Supervised learning uses labelled data, unsupervised learning identifies patterns in unlabeled data, semi-supervised combines labelled and unlabeled data, and reinforcement learning involves training agents through rewards and punishments.

Real-Life Usage: Supervised for predicting, unsupervised for clustering, semi-supervised for limited labelled data, and reinforcement learning for game-playing AI.

About Dataset: Training, Validation, Test

Definition: Understanding the three key subsets of data used in ML: training, validation, and test sets.

Real-Life Usage: Training set to build the model, validation set to tune hyperparameters, and test set to evaluate the model's performance.

Anomaly Detection

Definition: Identifying rare instances that significantly differ from the norm.

Real-Life Usage: Fraud detection in banking, fault detection in manufacturing, and identifying outliers in healthcare data.

Overfitting, Underfitting, and Generalized Model

Definition: Exploring the challenges of model fitting: overfitting (high complexity), underfitting (low complexity), and achieving a balanced, generalized model.

Check/Detect: Overfitting can be detected if the model performs well on training data but poorly on unseen data. Underfitting is identified when the model performs poorly on both training and test data.

Techniques to Reduce: For overfitting, reduce model complexity, increase data, or use regularization techniques like Lasso or Ridge. For underfitting, increase model complexity or gather more relevant features.

Bias and Variance, Bias-Variance Tradeoff

Effect on Model: High bias results in systematic errors, while high variance leads to sensitivity to small fluctuations in training data.

Bias-Variance Tradeoff: Achieving a balance between bias and variance is crucial. Lowering bias can increase variance and vice versa.

Missing Values and Types: MCAR, MAR, MNAR

Detection: MCAR (Missing Completely at Random), MAR (Missing at Random), MNAR (Missing Not at Random). Check patterns of missing values and their potential correlations with other variables.

Handling: Delete rows/columns with too many missing values. Impute values using mean, median, or mode for MCAR and MAR. Use advanced methods for MNAR.

领英推荐

Algorithmic and Human AI Guardrails, Deep…

Open Data Science Conference (ODSC) 9 个月前

Key Differences Between Data Science and Artificial…

Blockchain Council 1 年前

RAG Tools to Improve LLMs, Deep Learning for Time…

Open Data Science Conference (ODSC) 11 个月前

Imputation Techniques in Python:

import pandas as pd 
# Assuming df is your DataFrame 
# For MCAR and MAR 
df.fillna(df.mean(), inplace=True) # Impute missing values with mean

Imbalanced Data, Handling Techniques

Techniques: Upsampling (increasing minority class instances), downsampling (reducing majority class instances), SMOTE (Synthetic Minority Over-sampling Technique), and data interpolation with linear, cubic, or polynomial methods.

SMOTE in Python:

from imblearn.over_sampling import SMOTE 
smote = SMOTE() 
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Outliers and Handling Outliers: 5 Number Summary

Definition: Outliers are data points that deviate significantly from the rest of the data.

Handling: Identify outliers using the 5 Number Summary (min, Q1, median, Q3, max) and interquartile range (IQR). Consider removing or transforming extreme outliers.

IQR Method for Outlier Detection and Removal:

Q1 = df['Column'].quantile(0.25) 
Q3 = df['Column'].quantile(0.75) 
IQR = Q3 - Q1 
lower_bound = Q1 - 1.5 * IQR 
upper_bound = Q3 + 1.5 * IQR 
outliers_removed = df[(df['Column'] > lower_bound) & (df['Column'] < upper_bound)]

Feature Extraction and Types

Feature Scaling: Standardization (z-score normalization), Normalization (Min-Max scaling), Unit Vectors (scaling to unit length).

Feature Selection: Filter (statistical tests), Wrapper (model performance), Embedded (combining with model training).

PCA (Principal Component Analysis): Steps involve mean centering, covariance matrix computation, eigenvector and eigenvalue calculation, selecting principal components based on the variance explained, and projecting data onto selected components.

Principal Component Analysis (PCA) with scikit-learn:

from sklearn.decomposition import PCA 
pca = PCA(n_components=2) # Set desired number of principal components 
X_pca = pca.fit_transform(X) # X is your data matrix

Data Encoding and Types: Nominal, OHE, Label, Ordinal, Target Guided Ordinal

Nominal and One-Hot Encoding (OHE): For categorical data without order, use OHE for each category.

Label and Ordinal Encoding: Label encoding assigns numerical labels, and ordinal encoding preserves order in categorical variables.

Target Guided Ordinal Encoding: Assigning labels based on the target variable's mean.

Covariance and Correlation

Covariance: A measure of how changes in one variable are associated with changes in another. Formula: Cov(X, Y) = Σ[(X - X?) * (Y - ?)] / (n - 1).

Correlation (Pearson, Spearman Rank): Pearson measures linear relationships, and Spearman assesses monotonic relationships between variables.

Stay curious and keep exploring the amazing world of data! ???? #DataScience #MachineLearning #FeatureEngineering #DataMagic

要查看或添加评论，请登录

Varsha Biswal的更多文章

Week 16 of Data Science: Decision Tree and Support Vector Machine

2023年9月6日

Week 16 of Data Science: Decision Tree and Support Vector Machine

Hey there, fellow learners! ??? I'm thrilled to share with you my progress in Week 16 of my data science journey with…
Week 15 Data Science Journey: Linear and Logistic Regression

2023年8月29日

Week 15 Data Science Journey: Linear and Logistic Regression

"Machine learning is not magic; it's just a tool. It's the data and your understanding of the problem that brings real…
Week 14 : Exploring Insights through Data (A Journey in Exploratory Data Analysis)

2023年8月19日

Week 14 : Exploring Insights through Data (A Journey in Exploratory Data Analysis)

This is my favorite part about analytics: Taking boring flat data and bringing it to life through visualization. John…
Advanced Statistics Part 2: Week 12 Adventures in Data Science!

2023年7月29日

Advanced Statistics Part 2: Week 12 Adventures in Data Science!

“Facts are stubborn things, but statistics are pliable..
E-commerce Customer Churn Analysis using SQL

2023年7月29日

E-commerce Customer Churn Analysis using SQL

1. INTRODUCTION: Customer Churn in E-commerce: - Customer churn refers to when customers stop doing business with a…
Advanced Statistics: Week 11 Adventures in Data Science!

2023年7月27日

Advanced Statistics: Week 11 Adventures in Data Science!

“Statistics can be made to prove anything — even the truth.” Introduction: Greetings, data enthusiasts! Hold on tight…
Unravelling the Statistical Enigma: Week 10 Adventures in Data Science!

2023年7月23日

Unravelling the Statistical Enigma: Week 10 Adventures in Data Science!

“Data scientist (n.): Person who is better at statistics than any software engineer and better at software engineering…
The Power of Bitwise Magic: Achieving List Shuffling with O(1) Space Complexity"

2023年7月11日

The Power of Bitwise Magic: Achieving List Shuffling with O(1) Space Complexity"

Introduction: In the world of computer programming, optimizing algorithms and reducing space complexity are essential…
(Week 9) NumPy and Visualization Tools: A Journey into Efficient Data Manipulation and Stunning Visualizations!

2023年7月6日

(Week 9) NumPy and Visualization Tools: A Journey into Efficient Data Manipulation and Stunning Visualizations!

"With NumPy, conquer the realm of efficient number crunching, while visualization tools like Matplotlib, seaborn…
Week 8: Pandas: A Journey into Data Manipulation and Analysis!

2023年7月5日

Week 8: Pandas: A Journey into Data Manipulation and Analysis!

"Pandas, the powerhouse of data manipulation and analysis is the secret ingredient that fuels informed decision-making…

See all articles

Week 13 of Data Science : Machine Learning and Feature Engineering Part 1

Varsha Biswal

Associate Software Engineer At Accenture In India | SAP | FICA

AI vs ML vs DL vs DS

Types of ML: Supervised, Unsupervised, Semi-Supervised, Reinforcement Learning

About Dataset: Training, Validation, Test

Anomaly Detection

Overfitting, Underfitting, and Generalized Model

Bias and Variance, Bias-Variance Tradeoff

Missing Values and Types: MCAR, MAR, MNAR

领英推荐

Imbalanced Data, Handling Techniques

Outliers and Handling Outliers: 5 Number Summary

Feature Extraction and Types

Data Encoding and Types: Nominal, OHE, Label, Ordinal, Target Guided Ordinal

Covariance and Correlation

Varsha Biswal的更多文章

社区洞察

其他会员也浏览了

How can machine learning be used to improve existing algorithms?

Data Science Research Round-Up, GPT-3 Business Use Cases, and Choosing the Right Activation Function

Decoding The Distinction: What Is The The Difference Between Data Science And Machine Learning

Importance of Datasets in Machine Learning and AI Research

Machine Learning

The Intersection of Data Science and AI: Transforming Data into Intelligent Action!

Recent Application Of Machine Learning

Graph Machine Learning: It's Everywhere!

Scaling Techniques in Machine Learning: A Beginner's Guide

Artificial Intelligence No 52: An introduction to causal machine learning

AI vs ML vs DL vs DS

Types of ML: Supervised, Unsupervised, Semi-Supervised, Reinforcement Learning

About Dataset: Training, Validation, Test

Anomaly Detection

Overfitting, Underfitting, and Generalized Model

Bias and Variance, Bias-Variance Tradeoff

Missing Values and Types: MCAR, MAR, MNAR

领英推荐

Imbalanced Data, Handling Techniques

Outliers and Handling Outliers: 5 Number Summary

Feature Extraction and Types

Data Encoding and Types: Nominal, OHE, Label, Ordinal, Target Guided Ordinal

Covariance and Correlation

Varsha Biswal的更多文章

Week 16 of Data Science: Decision Tree and Support Vector Machine

Week 15 Data Science Journey: Linear and Logistic Regression

Week 14 : Exploring Insights through Data (A Journey in Exploratory Data Analysis)

Advanced Statistics Part 2: Week 12 Adventures in Data Science!

E-commerce Customer Churn Analysis using SQL

Advanced Statistics: Week 11 Adventures in Data Science!

Unravelling the Statistical Enigma: Week 10 Adventures in Data Science!

The Power of Bitwise Magic: Achieving List Shuffling with O(1) Space Complexity"

(Week 9) NumPy and Visualization Tools: A Journey into Efficient Data Manipulation and Stunning Visualizations!

Week 8: Pandas: A Journey into Data Manipulation and Analysis!

社区洞察

其他会员也浏览了

How can machine learning be used to improve existing algorithms?

Data Science Research Round-Up, GPT-3 Business Use Cases, and Choosing the Right Activation Function

Decoding The Distinction: What Is The The Difference Between Data Science And Machine Learning

Importance of Datasets in Machine Learning and AI Research

Machine Learning

The Intersection of Data Science and AI: Transforming Data into Intelligent Action!

Recent Application Of Machine Learning

Graph Machine Learning: It's Everywhere!

Scaling Techniques in Machine Learning: A Beginner's Guide

Artificial Intelligence No 52: An introduction to causal machine learning