Data cleaning with help of python with pandas.

In this post, we will go through Number of data cleaning task using python library. We focus on data cleaning task for missing value.After reading of this article you surely understand data cleaning process.

According to Team of IBM Data analytic Team “ 80% time spend on data cleaning and organizing data & 20% of time on data analysis & modeling”.

Reason for missing values in Data set.

1.      User forget till up data.

2.      User just ignore filled data

3.      Non- mandatory field submitting form.

4.      Data loss during transfer due to slow internet  

5.      Due to programming error.

As you can see above-mentioned reason are simple due to which we spend 80% of time on data cleaning. It might another deeper reason also which due to which data is missing.

Today I will tell you how to detect missing value & do the same Basis Imputation.

As everyone always try to “ Impute missing value with the help of mean & median”

1.Before you go for data cleaning you surely or can be good idea just get feed of your data set. After you can plan for data cleaning.

To explain you, am considering Below data set example 

No alt text provided for this image

as you can see that above is vary small data set. I will firstly explain you standard missing values.

Standard missing values -

Standard missing values refer to those values which detected by pandas library.If you look at original data set at "ID " column in data-set,3rd row of "ID" column has empty cell and 7th row there is "NA" values in data set which are clearly missing values.

Let see how the pandas deal with such values.

No alt text provided for this image
No alt text provided for this image

You can find that pandas filled missing values with "Nan" values.Using isnull() take look at output we can say that pandas filled missed values with help of Boolean expression missing values are treated as "True"

In next section,We considered same example where pandas will not recognized the missing values.

To understand this,consider "Bedroom_num" column.If you clearly check this column you will have the 3 missing values.

  1. N/A
  2. NA
  3. --
No alt text provided for this image

We already seen that NA and blank values are replace with NAN values by pandas but what about N/A and --. As you can see that these are small typing mistake by user.

Let see how pandas deal with such missing values.

No alt text provided for this image

If you check out of isnull() particularly for bedroom column we can see that -- this values considered as correct data by pandas data frame.

To deal with such problem,we simply create list of missing values and when you import data then pandas will recognized was missing values.

No alt text provided for this image

Unexpected Standard missing values

Unexpected meas we are not expecting missing values in particular column.To make you more explanatory consider "Present" column has all data to be string but there is one numeric values.

No alt text provided for this image

You can see that in 4th row of present column you will see the 13 as numeric values.Let see how our pandas deal with such data.as per data we gone through the we understand that own occupied should hence only Y or N.So technically it should be missing value.So to deal with situation, where are no of approaches for detecting such type of missing value.You can see that 13 consider as correct data and blank values considered as "NAN" values

No alt text provided for this image

To deal with such situation i found 4 line of code and run for loop through such column and check that if value is changed to Integer then its missing value OR else its not.

Count = 0
for row in df [“ count,” own occupied”:
        try:
        int(row)
        df.loc[cnt, 'OWN_OCCUPIED']=np.nan
        except ValueError:
        pass
    cnt+=1

In above code

1.      We, running for loop through selected column.

2.      If value can be change to Integer the we mark this value as missing with help of numpy

3.      If value does not change we will keep as is and pass

Here I used try and expectation Block to Handle value error.If no is not integer, value error will be returned and code will stop there.

After handing missing values,we can replace this missing values with single value.such as

data["Bedroom_num"].fillna("Y",inplace  = True)











Ramkrishna Vyankatesh Odla

Crowdstrike | Ex-Accenturite | Software Test Engineer | AEM Content Author | ISTQB?

4 年

Great.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了