Data cleaning with help of python with pandas.
Akshay Vilayatkar
Data Scientist |Machine Learning |Python|Power BI l Data Analysis | SQL | DAX | Tableau | DataIKU l Domains— Finance l Media l Banking
In this post, we will go through Number of data cleaning task using python library. We focus on data cleaning task for missing value.After reading of this article you surely understand data cleaning process.
According to Team of IBM Data analytic Team “ 80% time spend on data cleaning and organizing data & 20% of time on data analysis & modeling”.
Reason for missing values in Data set.
1. User forget till up data.
2. User just ignore filled data
3. Non- mandatory field submitting form.
4. Data loss during transfer due to slow internet
5. Due to programming error.
As you can see above-mentioned reason are simple due to which we spend 80% of time on data cleaning. It might another deeper reason also which due to which data is missing.
Today I will tell you how to detect missing value & do the same Basis Imputation.
As everyone always try to “ Impute missing value with the help of mean & median”
1.Before you go for data cleaning you surely or can be good idea just get feed of your data set. After you can plan for data cleaning.
To explain you, am considering Below data set example
as you can see that above is vary small data set. I will firstly explain you standard missing values.
Standard missing values -
Standard missing values refer to those values which detected by pandas library.If you look at original data set at "ID " column in data-set,3rd row of "ID" column has empty cell and 7th row there is "NA" values in data set which are clearly missing values.
Let see how the pandas deal with such values.
You can find that pandas filled missing values with "Nan" values.Using isnull() take look at output we can say that pandas filled missed values with help of Boolean expression missing values are treated as "True"
In next section,We considered same example where pandas will not recognized the missing values.
To understand this,consider "Bedroom_num" column.If you clearly check this column you will have the 3 missing values.
- N/A
- NA
- --
We already seen that NA and blank values are replace with NAN values by pandas but what about N/A and --. As you can see that these are small typing mistake by user.
Let see how pandas deal with such missing values.
If you check out of isnull() particularly for bedroom column we can see that -- this values considered as correct data by pandas data frame.
To deal with such problem,we simply create list of missing values and when you import data then pandas will recognized was missing values.
Unexpected Standard missing values
Unexpected meas we are not expecting missing values in particular column.To make you more explanatory consider "Present" column has all data to be string but there is one numeric values.
You can see that in 4th row of present column you will see the 13 as numeric values.Let see how our pandas deal with such data.as per data we gone through the we understand that own occupied should hence only Y or N.So technically it should be missing value.So to deal with situation, where are no of approaches for detecting such type of missing value.You can see that 13 consider as correct data and blank values considered as "NAN" values
To deal with such situation i found 4 line of code and run for loop through such column and check that if value is changed to Integer then its missing value OR else its not.
Count = 0 for row in df [“ count,” own occupied”: try: int(row) df.loc[cnt, 'OWN_OCCUPIED']=np.nan except ValueError: pass cnt+=1
In above code
1. We, running for loop through selected column.
2. If value can be change to Integer the we mark this value as missing with help of numpy
3. If value does not change we will keep as is and pass
Here I used try and expectation Block to Handle value error.If no is not integer, value error will be returned and code will stop there.
After handing missing values,we can replace this missing values with single value.such as
data["Bedroom_num"].fillna("Y",inplace = True)
Crowdstrike | Ex-Accenturite | Software Test Engineer | AEM Content Author | ISTQB?
4 年Great.