Data Wrangling For Data Analysis
With Example Code and Explaination

Data Wrangling For Data Analysis


Data Wrangling is a process of converting one data format to another data format which is helpful for better and accurate Analysis.

Is Data Wrangling Really Important for Analysis?

It's important for better Analysis.

Consider a Scenario in which you have a dataset that contains so many empty fields and the names of the features(attributes) of the dataset that we didn't understand, instead of int or float the Object(string) data type includes.

Is this type of data format suitable for your Analysis? ok, If yes do you think that the model gives accurate output (May be or May not be)

For this type of confusion, Data Wrangling is helpful for the Analysts.It can overcome all the false scenarios.

Before Applying Data Wrangling you must understand what the dataset describes About, and what type of data it has, at which sources the data has been taken.

  • Understand the attributes of the dataset(column).
  • what type of data that the column contains.
  • Is the attribute(column) name understandable or not?
  • How many empty values that the dataset contains?
  • Can we remove the empty fields or replace the value with something?
  • And you have a basic idea about the dataset.


Let's Discover With An Example:

I take a car dataset for my Analysis [Download Dataset].

Firstly, import the necessary packages (it's based on the features and Operations that you want to apply to the data set)

pandas: To read the data from URLs or different file formats and helps to manipulate the data in the dataset.

NumPy: To do mathematical operations

matplotlib:For Data visualization.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
        

In the program I take a dataset from URL .

file_path="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
df = pd.read_csv(file_path,names=headers)        

I want to see the 2nd row data,that I have a basic idea about the dataset.

print(df.iloc[2])        
Output the 2nd row

In the Dataset, some fields contain a value of "?"

there are some '?' values are there so we need to change it to nan(NaN)

for now, I want to find out "?" the symbol present in which rows so I use the .isin() method for a target lock and the .any() method for the value is there or not
print(df.isin(["?"]).any())
df.replace('?',np.nan,inplace=True)        

"inplace=True" The changes directly applied in the dataset.

Now, I want to find out how many NaN values are Present in the Dataset and further replace them with the appropriate values.

The checknull() is a function that tells any empty values present or not in the particular individual column and displays how many are and not are.

def checknull():
    missing = df.isnull()
    for cm in missing.columns.tolist():
        print(cm)
        print(missing[cm].value_counts())
        print("")        
sample output from one column True represents NaN values


we have see that some columns have missing values , one column is "normalized-losses" in the column ther are totally 4 missing values are there so we need to replace with mean.
avgmean=df["normalized-losses"].astype("float").mean(axis=0)
df["normalized-losses"].replace(np.nan,avgmean,inplace=True) 
# we replace the value succesfully.        

The column num-of-doors has 2 NaN values those are in the type of object if its the number we find out what is the maximum value on the column but it's string. in that case, we need to calculate the maximum value present in the column. with the help of .value_counts() its possible and find out the maximum occurrence .idmax() helpful for this scenario.

numofdoors=df["num-of-doors"].value_counts().idxmax()
df["num-of-doors"].replace(np.nan,numofdoors,inplace=True)        

The column "bore" has totally 4 empty Values. so replace with the mean.

"axis=0" means apply on the operation at row wise.

df["bore"].replace(np.nan,df["bore"].astype("float").mean(axis=0),inplace=True)        

The column stroke has 4 NaN values. This is in the type:float replace empty values with mean axis=0 refers the row axis=1 refers the column inplace=True refers change the edits directly in the dataset.

df["stroke"].replace(np.nan,df["stroke"].astype("float").mean(axis=0),inplace=True)        

".astype() " is used for type conversion like int->float,float->int e.t.c.

where mean operation is preferred for float.

The column horsepower has 2 empty fields. replace it by mean.

df["horsepower"].replace(np.nan,df["horsepower"].astype("float").mean(axis=0),inplace=True)        

The column "peak-rpm" has 2 empty fields replaced by mean.

df["peak-rpm"].replace(np.nan,df["peak-rpm"].astype("float").mean(axis=0),inplace=True)        
The price has totally 4 empty values. we drop the price rows where the empty values present. Because the price is the main factor of all the features in the datasets.
df.dropna(subset=["price"],inplace=True)
df.reset_index(drop=True,inplace=True)        

Change the datatypes :

we can change multiple columns datatypes at the same time using list
df[["bore", "stroke","price","peak-rpm"]] = df[["bore", "stroke","price","peak-rpm"]].astype("float")
df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")        

Data Standardization:

You usually collect data from different agencies in different formats. (Data standardization is also a term for a particular type of data normalization where you subtract the mean and divide by the standard deviation.)
Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.

Example:

Transform mpg to L/100km

df["city-L/100KM"]=235/df["city-mpg"]
df.drop("city-mpg",axis=1,inplace=True)
df["highway-mpgmpg to L/100km"]=235/df["highway-mpg"]
df.rename(columns={"highway-mpg":"highway-L/100KM"},inplace=True)        

We change or rename the data from one form to another form and delete the old column name.


Data Normalization:

Normalization is the process of transforming values of several variables into a similar range.

We Normalize the values based on certain factors suppose if a column contains a value between 10000 to 70000 it's impossible to make a graph or any operations on it, so in that case of the situations the normalization helps us to analyze the data effectively.

Some Normalization Techniques

We Implement Simple Feature Scaling in the attributes of "length,width,height"

df["length"]=df["length"]/df["length"].max()
df["width"]=df["width"]/df["width"].max()
df["height"]=df["height"]/df["height"].max()        

Binning:

Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.

Example:

In your data set, "horsepower" is a real valued variable ranging from 48 to 288 and it has 59 unique values. What if you only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? You can rearrange them into three ‘bins' to simplify analysis.

Use the Pandas method 'cut' to segment the 'horsepower' column into 3 bins.

bins = np.linspace(df["horsepower"].min(),df["horsepower"].max(),4)
groups_names=["LOW","MEDIUM","HIGH"]
df["horsepower_binned"]=pd.cut(df["horsepower"],bins,labels=groups_names,include_lowest=True)        

We create a new Column called "horsepower_binning" for better Analysis.

np.linspace(<start_value>,<end_value>,<digits>)=>uses for generate numbers in between range.

Visualization of binning before and After:

plt.hist(df["horsepower"])
plt.show()
plt.bar(groups_names,df["horsepower_binned"].value_counts())
plt.show()        
Before and After Binning



Now there is a colum called "fuel-type" it contains the values of gas and diesel so while applying to the model the model didn't understant the categorical values it understand the numerical values. so, we can change it to numerical using pd.get_dummies() method This method assign 1 if the value is present that row else 0
getdummi=pd.get_dummies(df["fuel-type"])
df=pd.concat([df,getdummi],axis=1)
indicator=pd.get_dummies(df["aspiration"])
df=pd.concat([df,indicator],axis=1)
df.drop("fuel-type",axis=1,inplace=True)
df.drop("aspiration",axis=1,inplace=True)        

pd.concat() method is used for to add extra columns in the existing data set and it also called feature engine generation.

df.drop() is used to drop the existing columns axis=1 refers drop the entire column.

Finally we can save the new csv file for further processes like evaluating the models.

.to_csv() function saves the modified dataset into a new name "cleanCar.csv" .

df.to_csv("cleanCar.csv")        

Thats why the data wrangling is useful for better Analysis and after the transformation the dataset is ready to evaluate the model.

Download full code at:Download



要查看或添加评论,请登录

SATYA RAM KUMAR的更多文章

  • Python Decision Tree

    Python Decision Tree

    ?? Exciting Update ?? I'm thrilled to share a recent achievement in my data science journey! ???? ?? Decision Tree…

    1 条评论

社区洞察

其他会员也浏览了