Data Preprocessing in Machine Learning

Data Preprocessing in Machine Learning

Data Preprocessing Cheat Sheet in Machine Learning

1. Handling Missing Data

  • Identify Missing Data: Check for NaNs or blanks in the dataset.
  • Imputation (Replacing missing values):Mean Imputation: Replace missing values with the mean of the column

  • Median Imputation: Replace missing values with the median of the column

  • .Mode Imputation: Replace missing categorical values with the mode.
  • Deletion: Remove rows or columns with missing values (careful selection).

2. Data Standardization

  • Standardization (Z-score normalization)

  • where μ\muμ is the mean and σ\sigmaσ is the standard deviation.

3. Data Normalization

  • Normalization (Min-Max scaling):

  • 4. Categorical Data Encoding

  • One-Hot Encoding: Convert categorical variables into binary vectors.
  • Label Encoding: Convert categorical variables into numeric labels.

5. Feature Selection

  • Filter Methods: Select features based on statistical measures like correlation.
  • Wrapper Methods: Use machine learning models to evaluate subsets of features.
  • Embedded Methods: Feature selection as part of model training (e.g., Lasso regression).

6. Data Transformation

  • Log Transformation: x′=log(x)x' = \log(x)x′=log(x)
  • Box-Cox Transformation: x′=xλ?1λx' = \frac{x^\lambda - 1}{\lambda}x′=λxλ?1, where λ\lambdaλ is chosen to maximize normality.

7. Handling Outliers

  • Identification: Use statistical methods (e.g., Z-score, IQR) to identify outliers.
  • Handling: Replace outliers, cap them, or remove them based on domain knowledge.

8. Dimensionality Reduction

  • PCA (Principal Component Analysis): Transform data into a lower-dimensional space while retaining variance.
  • t-SNE (t-Distributed Stochastic Neighbor Embedding): Visualize high-dimensional data.

Steps in Data Preprocessing

  1. Data Cleaning: Handle missing data, remove duplicates.
  2. Data Integration: Merge data from multiple sources.
  3. Data Transformation: Normalize, standardize, encode categorical variables.
  4. Data Reduction: Reduce dimensions, select relevant features.
  5. Data Discretization: Binning numerical variables.

Example Workflow

  • Step 1: Load dataset and inspect for missing values.
  • Step 2: Impute missing values using mean or median.
  • Step 3: Standardize numerical features using Z-score.
  • Step 4: Encode categorical features using one-hot encoding.
  • Step 5: Select relevant features using correlation matrix or feature importance.
  • Step 6: Apply PCA for dimensionality reduction if needed.
  • Step 7: Split data into training and test sets for model building.


要查看或添加评论,请登录

Shailendra Kumar Sahu的更多文章

社区洞察

其他会员也浏览了