登录查看更多内容

Data Cleaning

Asokamoorthy K

Senior Data Architect / Data Engineer

发布日期: 2018年9月19日

+ 关注

The following common problems are facing before clean the data

Duplicate records [ R Williams = Rob Williams]
Misfield values [ Country = " New India"]
Missing Values [Age =00]
Illegal Values [Gender = Q ]
Violated Attributes dependencies - [ Zip 600 041 City = "India" ]
Multiple values in Single column [name = " Kumar 56382 -980 " ]

Standardize numerical data (e.g. mean of 0 and standard deviation of 1) using the scale and center options (StandardScaler in Python SKLearn)
Normalize numerical data (e.g. to a range of 0-1) using the range option.

We can replace with Mean Value of Age column and In the Cabin column replace with common values

Categorical NaNs

Categorical values can be a bit trickier, so you should definitely pay attention to your model performance metrics after editing (compare before and after). The standard thing to do is to replace the missing entry with the most frequent one:

Mean, Median and Mode

Computing the overall mean, median or mode is a very basic imputation method, it is the only tested function that takes no advantage of the time series characteristics or relationship between the variables. It is very fast, but has clear disadvantages. One disadvantage is that mean imputation reduces variance in the dataset.

library(imputeTS)
na.mean(mydata, option = "mean")   # Mean Imputation
na.mean(mydata, option = "median") # Median Imputation
na.mean(mydata, option = "mode")   # Mode Imputation
In Python
from sklearn.preprocessing import Imputer
values = mydata.values
imputer = Imputer(missing_values=’NaN’, strategy=’mean’)
transformed_values = imputer.fit_transform(values)
# strategy can be changed to "median" and “most_frequent”

Imputation of Categorical Variables

Mode imputation is one method but it will definitely introduce bias
Missing values can be treated as a separate category by itself. We can create another category for the missing values and use them as a different level. This is the simplest method.
Prediction models: Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable (training) and another one with missing values (test). We can use methods like logistic regression and ANOVA for prediction
Multiple Imputation

There are algorithms that can be made robust to missing data, such as k-Nearest Neighbors that can ignore a column from a distance measure when a value is missing.

There are also algorithms that can use the missing value as a unique and different value when building the predictive model, such as classification and regression trees.

Sadly, the scikit-learn implementations of decision trees and k-Nearest Neighbors are not robust to missing values.

Data Cleaning

Asokamoorthy K

Senior Data Architect / Data Engineer

Categorical NaNs

Mean, Median and Mode

Imputation of Categorical Variables

更多精彩文章

社区洞察

其他会员也浏览了

Pipe(line) dreams Part I: Creating a single preprocessing pipeline in Python

Continuation Of our journey after passing the midway mark

Efficient Point Data Extraction from Zarr Datasets with FastAPI, Dask, and Xarray

Data Structures and Algorithms

Model deployment through an API

Extraction of Co-relation among Data Columns

Demystifying the Confusion Matrix: Unveiling Your 5-Detector's Performance (Part 4)

Categorical NaNs

Mean, Median and Mode

Imputation of Categorical Variables

Snowflake Features

2024年3月11日

Random Forest

2018年9月19日

Feature Selection / Feature Extraction

2018年9月19日

Data Scientist Interview Questions

2018年7月25日

#TechChallenge50 @ Capgemini, Mumbai

2018年5月22日

Data Science

2017年8月6日

社区洞察

其他会员也浏览了

Pipe(line) dreams Part I: Creating a single preprocessing pipeline in Python

Continuation Of our journey after passing the midway mark

Efficient Point Data Extraction from Zarr Datasets with FastAPI, Dask, and Xarray

Data Structures and Algorithms

Model deployment through an API

Extraction of Co-relation among Data Columns

Demystifying the Confusion Matrix: Unveiling Your 5-Detector's Performance (Part 4)