登录查看更多内容

Handling Missing Data (Brief review of Kaggle Data Cleaning Challenge)

Nikhil Bhatewara

Data Science, Experimentation, Product Analytics

发布日期: 2018年4月10日

The most challenging thing about Data Science is dirty data. We always wonder , why data collection process is so bad. Why can't everything be mandatory But that is not always possible and Hence we as are responsible to handle the dirty data. Dirty data can be any type such as missing values, duplicate data,incorrect data, inconsistent data etc.

This post is a review of "Kaggle Data Cleaning Challenge". We will use below techniques to clean dirty data

Handling Missing Values
Scaling and Normalizing Data
Handling Date values
Handling Character Encoding
Handling Inconsistent Data

Missing Values: This is a part and parcel of a data scientist. The most erroneous approach to handle missing values is to drop it. This will remove important values from our data and will create bias. Suppose in the titanic data if you delete all the missing values and are left with value having gender only as Female, your subsequent model will most likely predict that none of the "Male" survived.

Missing data is generated mainly for 2 reasons. Either the data does not exist or the data has not been recorded.

If the data does not exist (like the height of a child who does not have any child), then we cannot perform imputation on it. The best way is to keep those missing data as it is. I personally replace them with some text such as "Unknown" or any numerical value which does not effect my data.

If the data is not recorded, then we can use existing techniques to imput those values. Before applying any of the technique, carefully study your data. Refer data dictionary of the data set or approach the person who has created the data and try to understand every field.

Best Practices for handling missing values:

Take a count of missing values and also find what proportion of the data is missing. This will help you compare your work.
If you are sure that a column having missing values is not required in your analysis, drop the column instead of dropping the rows.
Use logic to impute values. E.g. If you have city and state available, you can find the corresponding Zipcode for the combination.
Replacing missing values with the mean of available values is not always the best way but its a quick way to do so.
Scikit learn's Imputer Class comes handy when dealing with numerical missing data.
Pandas fillna() is your best friend while replacing missing values.
If nothing works, you can simply replace missing values with custom text and drop them later on as per the requirement.

Scaling and Normalizing Values: Once you have have filled the missing pieces in the data, the next step is to change the scale of data. So why do we need to scale or normalize our data. Say you are dealing with currencies and suppose 1 US Dollar is equal to 100 Yen. Without scaling, 1 US Dollar has same importance as 100 Yen for our machine learning algorithm. To overcome this, we have to scale our data proportionally.

Scikit learn's Preprocessing library provides various methods for scaling data. In this kaggle exercise we have used MinMax scaler. We can see that the scaled data has retained its shape but the boundary (min and max) values have been changed.

Normalization is rather a radical transformation which change our data such that if follows Normal Distribution , also called Gaussian distribution. In this transformation, the shape our data forms a bell curve.

Handling Date values: Since most of the time, we get data having at least some form of Date. It can be a year or a date with month/date/year format or a date with time stamp. Since there is still no standard format of expressing the date, we will get date formats inconsistent across different datasets. Hence it is advisable to convert the date within your data into a standard format.

Pandas use a specific format of date which is "datetime64". You can read about the same here.
To convert our data into a standard date type, we can use pandas "to_datetime()" function which takes data and the format in which the current data is expressed. So if our data is "29/20/1992" then the format string will be format="%d/%m/%Y" and if our data is "05-20-92" then the format string will be format="%m-%d-%y"
This link can be referred for more python platform specific directives.

Note: If we have multiple date format within the same column, we can use "infer_datetime_format=True" while using to_datetime function. This will not work always and this slows down the performance too.

Handling Character Encoding: Ever ran into a data which has some gibberish symbols and you do not know how to deal with it? Blame wrong character encoding. By default, python use 'UTF-8' encoding which is standard text encoding.

If the file does not have 'UTF-8' encoding and if you do not mention the correct encoding while importing the file, you will ran into trouble. Instead of trying to guess the encoding, we can use chardet module which can identify the correct encoding (with confidence).
A thing to remember is that it is sensitive to the data. Hence the more lines you read, the better results will be obtained.
A good example is covered in the kaggle notebook.

Handling Inconsistent Data: Adam, AADAM, aDAM ! we know they all mean same but our algorithms consider them different. This arises due to inconsistent data entry and we have to address these issues before we start building any models. To identify similar words, we can start with fuzzywuzzy package. It calculates distance between 2 strings and returns a ratio. The closer the ratio is to 100, the smaller the distance between the two strings.

Reference : Kaggle Data Cleaning Challenge

Nikhil Bhatewara

Data Science, Experimentation, Product Analytics

6 年

Kyle McKiou Kate Strachnyi ? Nic Ryan Randy Lao ?? Vivek Kulkarni Favio Vázquez Tarry Singh Beau Walker Andreas Kretz Thank you all for inspiring to share the knowledge and learnings. Would really appreciate if you can share your best practices for handling missing data.

4 次回应

查看更多评论

要查看或添加评论，请登录

Nikhil Bhatewara的更多文章

Cross Validation for Model Evaluation

2018年4月16日

Cross Validation for Model Evaluation

After creating any machine learning model, we want to test our model ideally on a new sample of data but that is not…

1 条评论

Handling Missing Data (Brief review of Kaggle Data Cleaning Challenge)

Nikhil Bhatewara

Data Science, Experimentation, Product Analytics

Nikhil Bhatewara的更多文章

社区洞察

其他会员也浏览了

Empowering Insights: The Rise of Citizen Data Scientists in Modern Organizations

Data is NOT our Superpower.

Missing data, Information and Survivorship bias - Advanced Data Science perspectives

Data Demystified - Chapter 1: DIKW model

Why should YOU develop your Data skills?

Data Deluge

Data: It takes a village, but the buck has to stop somewhere

Steps to resolve the most common mistakes of data scientist

From Misstep to Mastery: A Data Analyst’s Journey Through Failure

Day 18: Handling Missing Data in Your Analysis: Bridging the Gaps

Nikhil Bhatewara的更多文章

Cross Validation for Model Evaluation

社区洞察

其他会员也浏览了

Empowering Insights: The Rise of Citizen Data Scientists in Modern Organizations

Data is NOT our Superpower.

Missing data, Information and Survivorship bias - Advanced Data Science perspectives

Data Demystified - Chapter 1: DIKW model

Why should YOU develop your Data skills?

Data Deluge

Data: It takes a village, but the buck has to stop somewhere

Steps to resolve the most common mistakes of data scientist

From Misstep to Mastery: A Data Analyst’s Journey Through Failure

Day 18: Handling Missing Data in Your Analysis: Bridging the Gaps