- Identify Missing Data: Check for NaNs or blanks in the dataset.
- Imputation (Replacing missing values):Mean Imputation: Replace missing values with the mean of the column
- Median Imputation: Replace missing values with the median of the column
- .Mode Imputation: Replace missing categorical values with the mode.
- Deletion: Remove rows or columns with missing values (careful selection).
- Standardization (Z-score normalization)
- where μ\muμ is the mean and σ\sigmaσ is the standard deviation.
- Normalization (Min-Max scaling):
- 4. Categorical Data Encoding
- One-Hot Encoding: Convert categorical variables into binary vectors.
- Label Encoding: Convert categorical variables into numeric labels.
- Filter Methods: Select features based on statistical measures like correlation.
- Wrapper Methods: Use machine learning models to evaluate subsets of features.
- Embedded Methods: Feature selection as part of model training (e.g., Lasso regression).
- Log Transformation: x′=log(x)x' = \log(x)x′=log(x)
- Box-Cox Transformation: x′=xλ?1λx' = \frac{x^\lambda - 1}{\lambda}x′=λxλ?1, where λ\lambdaλ is chosen to maximize normality.
- Identification: Use statistical methods (e.g., Z-score, IQR) to identify outliers.
- Handling: Replace outliers, cap them, or remove them based on domain knowledge.
- PCA (Principal Component Analysis): Transform data into a lower-dimensional space while retaining variance.
- t-SNE (t-Distributed Stochastic Neighbor Embedding): Visualize high-dimensional data.
- Data Cleaning: Handle missing data, remove duplicates.
- Data Integration: Merge data from multiple sources.
- Data Transformation: Normalize, standardize, encode categorical variables.
- Data Reduction: Reduce dimensions, select relevant features.
- Data Discretization: Binning numerical variables.
- Step 1: Load dataset and inspect for missing values.
- Step 2: Impute missing values using mean or median.
- Step 3: Standardize numerical features using Z-score.
- Step 4: Encode categorical features using one-hot encoding.
- Step 5: Select relevant features using correlation matrix or feature importance.
- Step 6: Apply PCA for dimensionality reduction if needed.
- Step 7: Split data into training and test sets for model building.