Data Cleaning
The following common problems are facing before clean the data
- Duplicate records [ R Williams = Rob Williams]
- Misfield values [ Country = " New India"]
- Missing Values [Age =00]
- Illegal Values [Gender = Q ]
- Violated Attributes dependencies - [ Zip 600 041 City = "India" ]
- Multiple values in Single column [name = " Kumar 56382 -980 " ]
- Standardize numerical data (e.g. mean of 0 and standard deviation of 1) using the scale and center options (StandardScaler in Python SKLearn)
- Normalize numerical data (e.g. to a range of 0-1) using the range option.
- We can replace with Mean Value of Age column and In the Cabin column replace with common values
Categorical NaNs
- Categorical values can be a bit trickier, so you should definitely pay attention to your model performance metrics after editing (compare before and after). The standard thing to do is to replace the missing entry with the most frequent one:
Mean, Median and Mode
- Computing the overall mean, median or mode is a very basic imputation method, it is the only tested function that takes no advantage of the time series characteristics or relationship between the variables. It is very fast, but has clear disadvantages. One disadvantage is that mean imputation reduces variance in the dataset.
library(imputeTS)
na.mean(mydata, option = "mean") # Mean Imputation
na.mean(mydata, option = "median") # Median Imputation
na.mean(mydata, option = "mode") # Mode Imputation
In Python
from sklearn.preprocessing import Imputer
values = mydata.values
imputer = Imputer(missing_values=’NaN’, strategy=’mean’)
transformed_values = imputer.fit_transform(values)
# strategy can be changed to "median" and “most_frequent”
Imputation of Categorical Variables
- Mode imputation is one method but it will definitely introduce bias
- Missing values can be treated as a separate category by itself. We can create another category for the missing values and use them as a different level. This is the simplest method.
- Prediction models: Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable (training) and another one with missing values (test). We can use methods like logistic regression and ANOVA for prediction
- Multiple Imputation
There are algorithms that can be made robust to missing data, such as k-Nearest Neighbors that can ignore a column from a distance measure when a value is missing.
There are also algorithms that can use the missing value as a unique and different value when building the predictive model, such as classification and regression trees.
Sadly, the scikit-learn implementations of decision trees and k-Nearest Neighbors are not robust to missing values.