EXLORATORY DATA ANALYSIS
Priyanshi Agarwal
SE at Lowe's India ||Ex-Intern at Hewlett Packard Enterprise||Mody University Student
Data Set used: supermarket sales
What is EDA?
EDA is an approach of analyzing the dataset to summarize the main characteristics of the dataset like the relationship between different features
We can do Eda using statistical graphics and other data visualization methods (can use pandas library or seaborn/matplotlib for the purpose of data visualization)
Why we need EDA?
Exploratory Data Analysis?is valuable to?data?science?projects?since it allows to get closer to the certainty that the future results will be valid, correctly interpreted, and applicable to the desired business contexts
In this article I have taken the supermarket sales dataset to explain the EDA process
Let's first know about our dataset (what dataset contains)
The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.
Attribute information
Importing dataset
df=pd.read_csv("C:/Users/DELL/Desktop/PROJECTS/supermarket sales analysis/market.csv")
what dataset look like?
we can see it using the following code
df.head()
Output:
So here is our data which looks like the above image having many features
lets have some knowledge about the features in the dataset
df.info()
this tells us about the datatypes of all the columns
Lets now get some statistical knowledge about our dataset
df.describe()
here we get the statistics of our dataset
we can now see the min , max , mean etc of every numerical column
Now check if our dataset contains any null value or not so that we can do treatment of null values by either dropping the fields having null values or by replacing them by a particular value or by replacing them by the mean of that particular column
What happens if we don't treat them rightly?
It may lead to a wrong prediction or may lead to building of a poor model
df.isnull().sum()
so from above result we can conclude that none of our column contains null values
Let's check the various product line(A?product line?in business is a group of related products under the same brand name manufactured by a company) present in our dataset
df.Productline.unique()
The above are the various product line present
lets check the count of products sold from every product line
df.Productline.value_counts()
here's the count of every product line we can say from above results that fashion accessories are the most sold products
lets now check the number of products sold from every branch (i.e . branch A,B,C)
df.Branch.value_counts()
Now we can say that branch A has sold maximum items
lets now check which payment method has been used by how many number of users
df.Payment.value_counts()
we can see that maximum number of customers have made the payment using Ewallet
Lets now check the correlation (a statistical term describing the degree to which two variables move in coordination with one-another) between every field
df.corr()
By seeing the correlation table we can conclude so many things .Lets understand this by taking the total amount column:
Lets now dig deep into this analysis by using data visualization:
Lets check the relation between total amount and gross income
plt.figure(figsize=(10,7))
sns.set(rc={'axes.facecolor':'pink','axes.grid': True,'xtick.labelsize':16})
sns.lineplot(x='Total',y='grossincome',data=df,linewidth=10,color='green')
These two columns are linearly related to each other. we have also seen the same result in the correlation table( their correlation value was 1)
领英推荐
Lets create a histogram( a graph that tells the frequency distribution) for the column name 'Branch'
plt.figure(figsize=(10,7))
sns.histplot(df['Branch'],color='green')
We can say that branch A has sold maximum products (same result we got when we analyzed using pandas)
Lets create a histogram for the column name 'Gender'
plt.figure(figsize=(10,9))
sns.histplot(df['Gender'],color='green')
We can say that Females customers are slightly more than male customers
Lets create a histogram for the column name 'Payment' which denotes payment type
plt.figure(figsize=(10,9))
sns.histplot(df['Payment'],color='green')
We can say that customers have mostly used Ewallet payment method (same result we got when we analyzed using pandas)
Lets create line plot to know the relation between two different columns
plt.figure(figsize=(10,9))
sns.lineplot(x=df['Quantity'],y=df['grossincome'],hue=df['Branch'],linewidth=10)
we can find a kind of linear relationship between the two attributes
2.here the columns unit price and tax
plt.figure(figsize=(10,9))
sns.lineplot(x=df['Unitprice'],y=df['Tax 5%'])
we can find relationship is fluctuating but overall it is kind of linear relationship between the two attributes
Lets create a histogram for the column name 'Productline'
plt.figure(figsize=(10,7))
sns.histplot(df['Productline'],color='green')
plt.xticks(rotation=90)
We can say that fashion accessories have been sold the most (same result we got when we analyzed using pandas)
Lets create a bar plot(bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars) between the fields Productline and Quantity Purchased by a customer
plt.figure(figsize=(10,7))
sns.barplot(x='Productline',y='Quantity',data=df)
plt.xticks(rotation=90)
Lets create a bar plot between the fields Branch and Rating Purchased by a customer
plt.figure(figsize=(10,7))
sns.barplot(x='Branch',y='Rating',data=df)
We can say that A and C branches are highly rated by customers
Lets create line plot to know the relation between two different columns here the columns are Unit price and rating
plt.figure(figsize=(10,7))
sns.lineplot(x=df['Unitprice'],y=df['Rating'])
Here the result is too much fluctuating so we cant say much about the relation between these two fields
Lets create line plot to know the relation between two different columns here the columns are Unit price and quantity
plt.figure(figsize = (10,7))
sns.lineplot(x = 'Unitprice', y = 'Quantity',data = df)
Here also the result is too much fluctuating so we cant say much about the relation between these two fields
Lets create line plot to know the relation between two different columns here the columns are grossincome and quantity
plt.figure(figsize = (10,7))
sns.lineplot(x = 'grossincome', y = 'Quantity',data = df)
Here we can say that there is a high correlation between these (the same result we got in correlation table
In this way we do the EDA part of our projects
So here was small EDA that I have performed .We can do a lot more !!