A Data Sapient Guide to Feature Engineering: Handling Missing Data

A Data Sapient Guide to Feature Engineering: Handling Missing Data

Feature engineering is the process of transforming and selecting relevant features from raw data to enhance machine learning model performance. By creating high-quality variables, it helps uncover patterns that improve predictions and overall interpretability.

Importance of Feature Engineering

  • Feature engineering is essential as it enhances model performance by utilizing high-quality data for training.
  • By highlighting relevant patterns and filtering out noise, it improves generalization to unseen data.
  • Moreover, simpler models with well-chosen features tend to perform better and are easier to interpret than complex models with irrelevant features.



Feature Transformation

Feature transformation involves modifying existing features to improve their effectiveness for modeling, enabling better handling of diverse data. This includes addressing missing values, scaling, encoding categorical variables, and applying mathematical functions to align features with model assumptions and enhance performance.


Handling Missing Values


Error Message During Model Fitting


While attempting to fit my data into the Logistic Regression model from scikit-learn, I encountered an error message indicating that the input data (X) contains missing values (NaN). Logistic Regression does not support NaN values, and many machine learning algorithms—including decision trees and support vector machines—require complete datasets without missing entries. If the input contains NaNs, these algorithms cannot perform calculations, leading to errors during training or predictions. Therefore, addressing missing values is a crucial preprocessing step in machine learning that significantly impacts model performance, accuracy, and reliability. This article focuses specifically on handling missing values, a common challenge in real-world datasets.

Understanding the nature of the missing values is crucial. Are they random, or do they follow a specific pattern? This insight can guide the choice of strategy for handling missing data and ensure that the approach aligns with the context of the dataset. Missing values can be categorized into three types:?

1. Missing Completely at Random (MCAR): Data is considered MCAR when the missingness of data is unrelated to both observed and unobserved features. For instance, if participants skip survey questions due to distraction or confusion, the missing responses are MCAR.

?2. Missing at Random (MAR): It occurs when the missingness is related to observed data but not to the missing values themselves. For example, if female participants in a survey are less likely to report their age than male participants, the missingness is related to the observed variable of gender.

?3. Missing Not at Random (MNAR): The missingness is related to the unobserved data itself. For instance, in a mental health survey, individuals with severe anxiety may be less likely to answer questions about their mental health, making the missing responses directly influenced by the severity of their condition.


Identifying and Visualizing Missing Data in Pandas

To identify missing values in a Pandas DataFrame, use df.isnull().sum() for a count of null entries per column.

Import pandas as pd
df = pd.read_csv('train.csv')
df.isnull().sum()        

The Missingno library provides visual tools for analyzing missing data patterns, including bar charts and matrix visualizations. The bar chart illustrates the extent of missing values for each feature, while the matrix visualization helps pinpoint the distribution of missing data. Additionally, the heatmap reveals correlations between missing values across different features.

import missingno as msno

#Bar Chart
msno.bar(train)
#Matrix Chart
msno.matrix(train)
msno.heatmap(train)        


Strategy for Handling Missing Values

Addressing missing values involves using domain knowledge to understand the underlying reasons for their absence. To handle missing values, We can employ strategies like removal, imputation, or creating flags to indicate their absence. Handling missing values is vital, as machine learning models typically require complete datasets for optimal training and prediction.

In this article, I will utilize subsets of the Titanic and California Housing datasets from Kaggle to demonstrate various techniques for effectively managing missing values.



Missingno Bar Visualization of Titanic Train Data


Missingno Matrix Visualization of Titanic Test Data


Summary of Missing Values in Training and Test Data


Deletion Strategies:

i) Listwise Deletion involves removing entire rows with any missing values, allowing analyses on complete datasets, also known as Complete Case Analysis (CCA).


Listwise Deletion

While it simplifies analysis, this method can result in significant data loss and potential bias if missingness isn’t completely random, ultimately leading to inadequate training and poorer model performance.

This approach is suitable only when missing data is very small and assumed to be Missing Completely at Random (MCAR). However, in real-world datasets, CCA is often impractical due to larger amounts of missing data. Analyzing individual variables can help determine the best strategies for addressing missing values.

# Remove rows with missing values in the 'Embarked' column

train.dropna(subset=['Embarked'],how='any',inplace=True)
test.dropna(subset=['Embarked'],how='any',inplace=True)        
Summary of Missing Values in Training and Test Data After Dropping Nulls from the Embarked Column


ii) Dropping Columns/Features with Missing Values: It involves removing features with over 80% missing data instead of removing entire rows, particularly in larger datasets where the missingness may not be informative.


This method preserves the number of available samples essential for effective model training and maintains a robust dataset that can improve analysis and enhance model performance.

# Drop 'Cabin' column due to excessive missing values
train.drop(columns=['Cabin'], axis=1, inplace=True)
test.drop(columns=['Cabin'], axis=1, inplace=True)        

iii) Pairwise Deletion: It is a method for handling missing data that retains all available information by analyzing data points specific to each analysis, rather than removing entire rows or columns with missing values.


Imputation Strategies:

Imputation involves filling in missing values with informed estimates, preserving valuable information and minimizing data loss.

a) Univariate Imputation:

It replaces missing values in a variable using non-missing values from that same variable. Common techniques for univariate imputation include:

i) Arbitrary Value Imputation is a specific method where missing values are filled with a predetermined constant value, rather than calculated estimates, which can simplify the process but may introduce bias.

Using Pandas:

Two Missing Values in Embarked Column in Train Data
# Fill missing values in 'Embarked' with 'S'
train['Embarked'] = train['Embarked'].fillna('S')
test['Embarked'] = train['Embarked'].fillna('S')        


After Imputing Constant in the Embarked Column


UsingScikit-Learn:

We can use SimpleImputer and ColumnTransformer from scikit-learn to efficiently impute both categorical and numerical data simultaneously. This method allows for customized imputation strategies for different data types, ensuring effective handling of missing values.


Some Missing Values in Age and Cabin Columns
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Define the imputers for each column
embarked_imputer = SimpleImputer(strategy='constant', fill_value='S')
cabin_imputer = SimpleImputer(strategy='constant', fill_value='G6')
age_imputer = SimpleImputer(strategy='constant', fill_value=35)
fare_imputer = SimpleImputer(strategy='constant', fill_value=8)

# Create a ColumnTransformer to apply different imputers
imputer = ColumnTransformer(
    transformers=[
        ('embarked', embarked_imputer, ['Embarked']),
        ('cabin', cabin_imputer, ['Cabin']),
        ('fare', fare_imputer, ['Fare']),
        ('age', age_imputer, ['Age'])
    ],
    remainder='drop'  # Drop columns not included in transformers
)        


Constant Value Imputation for Missing Data


ii) Imputation with Mean, Median, or Mode replaces missing values with the mean, median, or mode of the available data. The mode is used for categorical data, while the mean is suitable for normally distributed numerical data, and the median is preferred in the presence of outliers. These methods assume that missing data is Missing Completely at Random (MCAR). Although effective for small amounts of missing data, they can reduce variability and overlook important relationships between features.

We can use SimpleImputer and ColumnTransformer from scikit-learn to impute both categorical and numerical data simultaneously. It allows us to specify different strategies for each data type, such as using the mean or median for numerical data and the most frequent value (mode) for categorical data. Check my Medium article for more details.

Remember, imputation should use the training set's statistics to prevent data leakage and ensure that model training relies solely on available information, avoiding overfitting and inaccurate evaluation.

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

columns_to_impute = ['Embarked', 'Cabin', 'Fare', 'Age']
train_mixed = train[columns_to_impute]
test_mixed = test[columns_to_impute]

# Identify categorical and numerical columns
categorical_cols = train_mixed.select_dtypes(include=['category', 'object']).columns
numerical_cols = train_mixed.select_dtypes(include=['number']).columns

# Imputation strategies
categorical_transformer = SimpleImputer(strategy='most_frequent')
numerical_transformer = SimpleImputer(strategy='mean') # for median imputer, use strategy='median'

# Column Transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_cols),
        ('num', numerical_transformer, numerical_cols)
    ])

# Impute missing values
train_mixed = pd.DataFrame(preprocessor.fit_transform(train_mixed), columns=columns_to_impute)
test_mixed = pd.DataFrame(preprocessor.transform(test_mixed), columns=columns_to_impute)        


After Mean and Most Frequent Imputation in Numerical and Categorical Feature



Median Imputation:

from sklearn.impute import SimpleImputer

columns_to_impute = ['Fare','Age']
train_median = train[columns_to_impute]
test_median = test[columns_to_impute]

# Imputation strategies
imputer = SimpleImputer(strategy='median') 

# Impute missing values
train_median = pd.DataFrame(imputer.fit_transform(train_median), columns=columns_to_impute)
test_median = pd.DataFrame(imputer.transform(test_median), columns=columns_to_impute)
        


After Median Imputaion


KDE Plot of Original, Mean Imputed and Median Imputed Data


iii) Random Sample Imputation replaces missing data by randomly selecting values from existing data within the same variable, preserving its distribution. Suitable for data that is Missing Completely at Random (MCAR), it requires setting a random seed for consistency.

While easy to implement and maintaining variance, it can affect covariance and is memory-intensive, as the original dataset must be stored. This method works for both numerical and categorical data, preserving the frequency of existing categories.


train_random= train.copy()
test_random = test.copy()

# Impute missing values in train_random
missing_train = train_random['Age'].isnull().sum()
if missing_train > 0:
    samples_train = train_random['Age'].dropna().sample(missing_train, random_state=42).values
    train_random.loc[train_random['Age'].isnull(), 'Age'] = samples_train
    

# Impute missing values in test_random
missing_test = test_random['Age'].isnull().sum()
if missing_test > 0:
    samples_test = train_random['Age'].dropna().sample(missing_test, random_state=42).values
    test_random.loc[test_random['Age'].isnull(), 'Age'] = samples_test
    
    
train_random[(train_random.index == 5) | (train_random.index == 17) | (train_random.index == 19)]        


Random Value Imputation for Missing Data


iv) Imputation with Time Series Data involves techniques like Last Observation Carried Forward (LOCF), Next Observation Carried Backward (NOCB), mean/median filling, and interpolation to replace missing values.

  1. Last Observation Carried Forward (LOCF) uses the last known value to fill in missing data (forward fill, or 'ffill').
  2. Next Observation Carried Backward (NOCB) fills missing values with the next known value (backward fill, or 'bfill').
  3. Linear Interpolation estimates missing values by connecting existing data points with straight lines, providing a smooth representation based on surrounding trends.

For a more detailed exploration of imputation techniques in time series data, please refer to my Medium article.


b) Multivariate Imputation:

This method uses information from multiple features to estimate and fill in missing data, leading to more accurate and unbiased results. It employs machine learning models, such as k-nearest neighbors (KNN), random forest, or linear regression, to predict missing values.

For example, a regression model can use features like age and education to predict missing values in income feature in a dataset. This method is particularly effective for addressing Missing At Random (MAR) data, leveraging relationships between features.

For multivariate Imputation, I will use California Housing data from Kaggle to perform multivariate imputation, addressing missing values while considering the relationships between multiple features

Common techniques for multivariate imputation include:

i) K-Nearest Neighbors (KNN): The KNNImputer in Scikit-learn estimates missing values based on the values of the k nearest neighbors with available data. Imputation can use a simple or weighted average, where closer neighbors have a greater influence.

from sklearn.impute import KNNImputer

# Create copies of datasets
housing_knn = housing.drop(columns=['median_house_value']).copy(deep=True)

# Initialize the KNN imputer
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
# Fit the imputer 
housing_knn['total_bedrooms'] = knn_imputer.fit_transform(housing_knn[['total_bedrooms']])        


ii) Multivariate Imputation by Chained Equations (MICE) creates multiple imputed datasets to handle missing data iteratively. It treats each feature with missing values as a dependent variable and uses other features to predict those missing values. In each iteration, the algorithm updates the imputed values based on predictions from regression models, cycling through all variables to refine estimates. This method preserves relationships among variables and offers more robust estimates compared to simpler approaches.

# Enable the experimental feature
from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer

# Create copies of the datasets
train_mice = train.drop(columns=['Survived']).copy(deep=True)
test_mice = test.copy(deep=True)

# Initialize the Iterative Imputer
imputer = IterativeImputer()

# Fit the imputer on the training data and transform the 'Age' column
train_mice['Age'] = imputer.fit_transform(train_mice[['Age']])

# Transform the 'Age' column in the test data using the fitted imputer
test_mice['Age'] = imputer.transform(test_mice[['Age']])        


KDE Plot of Original, KNN Imputed and MICE Imputed Data


iii) Miss Forest is an advanced imputation method that uses Random Forests to fill in missing data. It starts by imputing missing values with the mean for continuous variables and the most frequent category for categorical variables. The dataset is then split into observed and missing parts, with the Random Forest model trained on the observed data to predict the missing values. This iterative process continues until changes in imputation are minimal or a set limit is reached, typically achieving reliable data after about 5 to 6 iterations. Overall, Miss Forest provides a precise and iterative approach to handling missing values.

train = pd.read_csv('./Datasets/titanic/train.csv')
train.drop(columns=['PassengerId', 'Survived', 'Name','Ticket'], inplace=True)

# Initialize the imputer
imputer = MissForest()

# Specify categorical columns
categorical_columns = train.select_dtypes('O').columns

# Fit and transform the data
df_imputed = imputer.fit_transform(train, categorical=categorical_columns)        


Adding Missing Indicator:

It is a technique for handling missing data by creating binary columns that indicate whether a value is missing or not. For each feature with missing values, a new column is added where 1 signifies a missing value and 0 indicates its presence. This helps track missing information and can enhance model predictions. For example, an indicator for a missing "Age" value may improve prediction accuracy.

X_train = train.drop(columns=['Survived'])
y_train = train['Survived']
X_test = test.copy()

X_train['Age_NA'] = np.where(X_train['Age'].isnull(), 1, 0)
X_test['Age_NA'] = np.where(X_test['Age'].isnull(), 1, 0)        

Original variables can still be imputed with the mean or median, allowing the model to leverage both the predictive power of the variable and the missing indicator. This method is particularly effective with linear models. However, adding these indicators increases the number of features, and if many variables have missing values for the same data points, they may become highly correlated.


Imputation Methods Based on Missing Data Types: A Guide

Choosing the right method for handling missing values depends on the context. While removing data is the fastest option, imputation is often a better choice.

Univariate mean/median imputation can be effective for large datasets with few missing entries. However, for datasets with significant missing values or complexity, advanced techniques like KNN, MissForest, or MICE are typically more effective and provide better predictive accuracy. Nevertheless, mean/median imputation tends to be faster than KNN or MissForest.

a) Missing Completely At Random (MCAR): Use mean, median, mode, or other imputation methods.

b) Missing At Random (MAR): Effective methods include multivariate imputation like regression imputation, KNN, MICE, and MissForest.

c) Missing Not At Random (MNAR):?

i) Modeling Missingness: Requires explicit models to address the relationship.

ii) Pattern Substitution: Fills in missing data based on identified patterns.

iii) Maximum Likelihood Estimation (MLE): Estimates missing values by maximizing the likelihood of observed data.


Algorithms That Handle Missing Data

Some machine learning algorithms effectively manage missing data, each employing unique strategies for robust performance.

  • Naive Bayes Classifier: Ignores missing values by calculating likelihoods based only on observed features and conditional probabilities from non-missing rows for that feature.
  • Decision Tree: Accommodates missing values by making splits based on available data, employing instance weights for impurity calculations and surrogate splits for accuracy.
  • XGBoost: Manages missing values during training by learning branch directions, treating designated missing values as NaN by default.
  • LightGBM: Handles missing values using NaN by default, but can treat zeros as missing if "zero_as_missing=true" is set; unrecorded values in sparse matrices are treated as zeros unless specified otherwise.


In conclusion, effectively managing missing values is essential for ensuring the accuracy and reliability of our analyses. By understanding the nature of missing data—whether MCAR, MAR, or MNAR—we can choose appropriate strategies to enhance model performance and derive deeper insights in predictive analytics.


At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. - Pedro Domingos


Curious about the details of handling missing values? Check out my latest Medium article for in-depth insights and strategies: The Art of Feature Engineering: Handling Missing Values

Thank you for reading! I’d love to hear your thoughts or any questions you might have, so feel free to drop a comment below.

Let’s share ideas and learn from each other!


** All images included in this newsletter are created by the author.

要查看或添加评论,请登录

Suparna Chowdhury的更多文章

  • DAY 5: PROSQL45 CHALLENGE

    DAY 5: PROSQL45 CHALLENGE

    A big thanks to Ankit Bansal the year-end challenge! I’m taking it on by solving 2 complex SQL questions every day from…

    1 条评论
  • PROSQL 45: Complex SQL Questions Solution

    PROSQL 45: Complex SQL Questions Solution

    DAY 4: PROSQL45 CHALLENGE (19 NOVEMBER, 2024) A big thanks to Ankit Bansal the year-end challenge! I’m taking it on by…

  • Striking the Balance: How to Avoid Underfitting & Overfitting in ML Models

    Striking the Balance: How to Avoid Underfitting & Overfitting in ML Models

    Creating a good model is about finding the right balance. If the model is too simple, it misses important patterns…

  • PROSQL 45: Complex SQL Questions Solution

    PROSQL 45: Complex SQL Questions Solution

    DAY 3: PROSQL45 CHALLENGE (18 NOVEMBER, 2024) A big thanks to Ankit Bansal the year-end challenge! I’m taking it on by…

  • PROSQL 45: Complex SQL Questions Solution

    PROSQL 45: Complex SQL Questions Solution

    DAY 2: PROSQL45 CHALLENGE (17 NOVEMBER, 2024) A big thanks to Ankit Bansal the year-end challenge! I’m taking it on by…

  • SQL CHALLENGE DAY 1

    SQL CHALLENGE DAY 1

    Big thanks to Ankit Bansal for the year-end challenge! I’m taking it on by solving 2 complex SQL questions every day…

  • Mastering MySQL: Key Techniques for SQL Query Optimization

    Mastering MySQL: Key Techniques for SQL Query Optimization

    In the fast-paced world of data management, optimizing SQL queries is crucial for enhancing performance and ensuring…

  • Mastering Dynamic Zone Visibility in Tableau

    Mastering Dynamic Zone Visibility in Tableau

    What is Dynamic Zone Visibility? Dynamic zone visibility is a powerful feature that can greatly enhance the…

社区洞察

其他会员也浏览了