A Data Sapient Guide to Feature Engineering: Handling Missing Data
Suparna Chowdhury
IBM Certified Data Scientist || Data Analyst || Machine Learning Enthusiast || Python || SQL || Tableau || Power BI
Feature engineering is the process of transforming and selecting relevant features from raw data to enhance machine learning model performance. By creating high-quality variables, it helps uncover patterns that improve predictions and overall interpretability.
Importance of Feature Engineering
Feature Transformation
Feature transformation involves modifying existing features to improve their effectiveness for modeling, enabling better handling of diverse data. This includes addressing missing values, scaling, encoding categorical variables, and applying mathematical functions to align features with model assumptions and enhance performance.
Handling Missing Values
While attempting to fit my data into the Logistic Regression model from scikit-learn, I encountered an error message indicating that the input data (X) contains missing values (NaN). Logistic Regression does not support NaN values, and many machine learning algorithms—including decision trees and support vector machines—require complete datasets without missing entries. If the input contains NaNs, these algorithms cannot perform calculations, leading to errors during training or predictions. Therefore, addressing missing values is a crucial preprocessing step in machine learning that significantly impacts model performance, accuracy, and reliability. This article focuses specifically on handling missing values, a common challenge in real-world datasets.
Understanding the nature of the missing values is crucial. Are they random, or do they follow a specific pattern? This insight can guide the choice of strategy for handling missing data and ensure that the approach aligns with the context of the dataset. Missing values can be categorized into three types:?
1. Missing Completely at Random (MCAR): Data is considered MCAR when the missingness of data is unrelated to both observed and unobserved features. For instance, if participants skip survey questions due to distraction or confusion, the missing responses are MCAR.
?2. Missing at Random (MAR): It occurs when the missingness is related to observed data but not to the missing values themselves. For example, if female participants in a survey are less likely to report their age than male participants, the missingness is related to the observed variable of gender.
?3. Missing Not at Random (MNAR): The missingness is related to the unobserved data itself. For instance, in a mental health survey, individuals with severe anxiety may be less likely to answer questions about their mental health, making the missing responses directly influenced by the severity of their condition.
Identifying and Visualizing Missing Data in Pandas
To identify missing values in a Pandas DataFrame, use df.isnull().sum() for a count of null entries per column.
Import pandas as pd
df = pd.read_csv('train.csv')
df.isnull().sum()
The Missingno library provides visual tools for analyzing missing data patterns, including bar charts and matrix visualizations. The bar chart illustrates the extent of missing values for each feature, while the matrix visualization helps pinpoint the distribution of missing data. Additionally, the heatmap reveals correlations between missing values across different features.
import missingno as msno
#Bar Chart
msno.bar(train)
#Matrix Chart
msno.matrix(train)
msno.heatmap(train)
Strategy for Handling Missing Values
Addressing missing values involves using domain knowledge to understand the underlying reasons for their absence. To handle missing values, We can employ strategies like removal, imputation, or creating flags to indicate their absence. Handling missing values is vital, as machine learning models typically require complete datasets for optimal training and prediction.
In this article, I will utilize subsets of the Titanic and California Housing datasets from Kaggle to demonstrate various techniques for effectively managing missing values.
Deletion Strategies:
i) Listwise Deletion involves removing entire rows with any missing values, allowing analyses on complete datasets, also known as Complete Case Analysis (CCA).
While it simplifies analysis, this method can result in significant data loss and potential bias if missingness isn’t completely random, ultimately leading to inadequate training and poorer model performance.
This approach is suitable only when missing data is very small and assumed to be Missing Completely at Random (MCAR). However, in real-world datasets, CCA is often impractical due to larger amounts of missing data. Analyzing individual variables can help determine the best strategies for addressing missing values.
# Remove rows with missing values in the 'Embarked' column
train.dropna(subset=['Embarked'],how='any',inplace=True)
test.dropna(subset=['Embarked'],how='any',inplace=True)
ii) Dropping Columns/Features with Missing Values: It involves removing features with over 80% missing data instead of removing entire rows, particularly in larger datasets where the missingness may not be informative.
This method preserves the number of available samples essential for effective model training and maintains a robust dataset that can improve analysis and enhance model performance.
# Drop 'Cabin' column due to excessive missing values
train.drop(columns=['Cabin'], axis=1, inplace=True)
test.drop(columns=['Cabin'], axis=1, inplace=True)
iii) Pairwise Deletion: It is a method for handling missing data that retains all available information by analyzing data points specific to each analysis, rather than removing entire rows or columns with missing values.
Imputation Strategies:
Imputation involves filling in missing values with informed estimates, preserving valuable information and minimizing data loss.
a) Univariate Imputation:
It replaces missing values in a variable using non-missing values from that same variable. Common techniques for univariate imputation include:
i) Arbitrary Value Imputation is a specific method where missing values are filled with a predetermined constant value, rather than calculated estimates, which can simplify the process but may introduce bias.
Using Pandas:
# Fill missing values in 'Embarked' with 'S'
train['Embarked'] = train['Embarked'].fillna('S')
test['Embarked'] = train['Embarked'].fillna('S')
UsingScikit-Learn:
We can use SimpleImputer and ColumnTransformer from scikit-learn to efficiently impute both categorical and numerical data simultaneously. This method allows for customized imputation strategies for different data types, ensuring effective handling of missing values.
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# Define the imputers for each column
embarked_imputer = SimpleImputer(strategy='constant', fill_value='S')
cabin_imputer = SimpleImputer(strategy='constant', fill_value='G6')
age_imputer = SimpleImputer(strategy='constant', fill_value=35)
fare_imputer = SimpleImputer(strategy='constant', fill_value=8)
# Create a ColumnTransformer to apply different imputers
imputer = ColumnTransformer(
transformers=[
('embarked', embarked_imputer, ['Embarked']),
('cabin', cabin_imputer, ['Cabin']),
('fare', fare_imputer, ['Fare']),
('age', age_imputer, ['Age'])
],
remainder='drop' # Drop columns not included in transformers
)
领英推荐
ii) Imputation with Mean, Median, or Mode replaces missing values with the mean, median, or mode of the available data. The mode is used for categorical data, while the mean is suitable for normally distributed numerical data, and the median is preferred in the presence of outliers. These methods assume that missing data is Missing Completely at Random (MCAR). Although effective for small amounts of missing data, they can reduce variability and overlook important relationships between features.
We can use SimpleImputer and ColumnTransformer from scikit-learn to impute both categorical and numerical data simultaneously. It allows us to specify different strategies for each data type, such as using the mean or median for numerical data and the most frequent value (mode) for categorical data. Check my Medium article for more details.
Remember, imputation should use the training set's statistics to prevent data leakage and ensure that model training relies solely on available information, avoiding overfitting and inaccurate evaluation.
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
columns_to_impute = ['Embarked', 'Cabin', 'Fare', 'Age']
train_mixed = train[columns_to_impute]
test_mixed = test[columns_to_impute]
# Identify categorical and numerical columns
categorical_cols = train_mixed.select_dtypes(include=['category', 'object']).columns
numerical_cols = train_mixed.select_dtypes(include=['number']).columns
# Imputation strategies
categorical_transformer = SimpleImputer(strategy='most_frequent')
numerical_transformer = SimpleImputer(strategy='mean') # for median imputer, use strategy='median'
# Column Transformer
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_cols),
('num', numerical_transformer, numerical_cols)
])
# Impute missing values
train_mixed = pd.DataFrame(preprocessor.fit_transform(train_mixed), columns=columns_to_impute)
test_mixed = pd.DataFrame(preprocessor.transform(test_mixed), columns=columns_to_impute)
Median Imputation:
from sklearn.impute import SimpleImputer
columns_to_impute = ['Fare','Age']
train_median = train[columns_to_impute]
test_median = test[columns_to_impute]
# Imputation strategies
imputer = SimpleImputer(strategy='median')
# Impute missing values
train_median = pd.DataFrame(imputer.fit_transform(train_median), columns=columns_to_impute)
test_median = pd.DataFrame(imputer.transform(test_median), columns=columns_to_impute)
iii) Random Sample Imputation replaces missing data by randomly selecting values from existing data within the same variable, preserving its distribution. Suitable for data that is Missing Completely at Random (MCAR), it requires setting a random seed for consistency.
While easy to implement and maintaining variance, it can affect covariance and is memory-intensive, as the original dataset must be stored. This method works for both numerical and categorical data, preserving the frequency of existing categories.
train_random= train.copy()
test_random = test.copy()
# Impute missing values in train_random
missing_train = train_random['Age'].isnull().sum()
if missing_train > 0:
samples_train = train_random['Age'].dropna().sample(missing_train, random_state=42).values
train_random.loc[train_random['Age'].isnull(), 'Age'] = samples_train
# Impute missing values in test_random
missing_test = test_random['Age'].isnull().sum()
if missing_test > 0:
samples_test = train_random['Age'].dropna().sample(missing_test, random_state=42).values
test_random.loc[test_random['Age'].isnull(), 'Age'] = samples_test
train_random[(train_random.index == 5) | (train_random.index == 17) | (train_random.index == 19)]
iv) Imputation with Time Series Data involves techniques like Last Observation Carried Forward (LOCF), Next Observation Carried Backward (NOCB), mean/median filling, and interpolation to replace missing values.
For a more detailed exploration of imputation techniques in time series data, please refer to my Medium article.
b) Multivariate Imputation:
This method uses information from multiple features to estimate and fill in missing data, leading to more accurate and unbiased results. It employs machine learning models, such as k-nearest neighbors (KNN), random forest, or linear regression, to predict missing values.
For example, a regression model can use features like age and education to predict missing values in income feature in a dataset. This method is particularly effective for addressing Missing At Random (MAR) data, leveraging relationships between features.
For multivariate Imputation, I will use California Housing data from Kaggle to perform multivariate imputation, addressing missing values while considering the relationships between multiple features
Common techniques for multivariate imputation include:
i) K-Nearest Neighbors (KNN): The KNNImputer in Scikit-learn estimates missing values based on the values of the k nearest neighbors with available data. Imputation can use a simple or weighted average, where closer neighbors have a greater influence.
from sklearn.impute import KNNImputer
# Create copies of datasets
housing_knn = housing.drop(columns=['median_house_value']).copy(deep=True)
# Initialize the KNN imputer
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
# Fit the imputer
housing_knn['total_bedrooms'] = knn_imputer.fit_transform(housing_knn[['total_bedrooms']])
ii) Multivariate Imputation by Chained Equations (MICE) creates multiple imputed datasets to handle missing data iteratively. It treats each feature with missing values as a dependent variable and uses other features to predict those missing values. In each iteration, the algorithm updates the imputed values based on predictions from regression models, cycling through all variables to refine estimates. This method preserves relationships among variables and offers more robust estimates compared to simpler approaches.
# Enable the experimental feature
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Create copies of the datasets
train_mice = train.drop(columns=['Survived']).copy(deep=True)
test_mice = test.copy(deep=True)
# Initialize the Iterative Imputer
imputer = IterativeImputer()
# Fit the imputer on the training data and transform the 'Age' column
train_mice['Age'] = imputer.fit_transform(train_mice[['Age']])
# Transform the 'Age' column in the test data using the fitted imputer
test_mice['Age'] = imputer.transform(test_mice[['Age']])
iii) Miss Forest is an advanced imputation method that uses Random Forests to fill in missing data. It starts by imputing missing values with the mean for continuous variables and the most frequent category for categorical variables. The dataset is then split into observed and missing parts, with the Random Forest model trained on the observed data to predict the missing values. This iterative process continues until changes in imputation are minimal or a set limit is reached, typically achieving reliable data after about 5 to 6 iterations. Overall, Miss Forest provides a precise and iterative approach to handling missing values.
train = pd.read_csv('./Datasets/titanic/train.csv')
train.drop(columns=['PassengerId', 'Survived', 'Name','Ticket'], inplace=True)
# Initialize the imputer
imputer = MissForest()
# Specify categorical columns
categorical_columns = train.select_dtypes('O').columns
# Fit and transform the data
df_imputed = imputer.fit_transform(train, categorical=categorical_columns)
Adding Missing Indicator:
It is a technique for handling missing data by creating binary columns that indicate whether a value is missing or not. For each feature with missing values, a new column is added where 1 signifies a missing value and 0 indicates its presence. This helps track missing information and can enhance model predictions. For example, an indicator for a missing "Age" value may improve prediction accuracy.
X_train = train.drop(columns=['Survived'])
y_train = train['Survived']
X_test = test.copy()
X_train['Age_NA'] = np.where(X_train['Age'].isnull(), 1, 0)
X_test['Age_NA'] = np.where(X_test['Age'].isnull(), 1, 0)
Original variables can still be imputed with the mean or median, allowing the model to leverage both the predictive power of the variable and the missing indicator. This method is particularly effective with linear models. However, adding these indicators increases the number of features, and if many variables have missing values for the same data points, they may become highly correlated.
Imputation Methods Based on Missing Data Types: A Guide
Choosing the right method for handling missing values depends on the context. While removing data is the fastest option, imputation is often a better choice.
Univariate mean/median imputation can be effective for large datasets with few missing entries. However, for datasets with significant missing values or complexity, advanced techniques like KNN, MissForest, or MICE are typically more effective and provide better predictive accuracy. Nevertheless, mean/median imputation tends to be faster than KNN or MissForest.
a) Missing Completely At Random (MCAR): Use mean, median, mode, or other imputation methods.
b) Missing At Random (MAR): Effective methods include multivariate imputation like regression imputation, KNN, MICE, and MissForest.
c) Missing Not At Random (MNAR):?
i) Modeling Missingness: Requires explicit models to address the relationship.
ii) Pattern Substitution: Fills in missing data based on identified patterns.
iii) Maximum Likelihood Estimation (MLE): Estimates missing values by maximizing the likelihood of observed data.
Algorithms That Handle Missing Data
Some machine learning algorithms effectively manage missing data, each employing unique strategies for robust performance.
In conclusion, effectively managing missing values is essential for ensuring the accuracy and reliability of our analyses. By understanding the nature of missing data—whether MCAR, MAR, or MNAR—we can choose appropriate strategies to enhance model performance and derive deeper insights in predictive analytics.
At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. - Pedro Domingos
Curious about the details of handling missing values? Check out my latest Medium article for in-depth insights and strategies: The Art of Feature Engineering: Handling Missing Values
Thank you for reading! I’d love to hear your thoughts or any questions you might have, so feel free to drop a comment below.
Let’s share ideas and learn from each other!
** All images included in this newsletter are created by the author.