EXLORATORY DATA ANALYSIS

Data Set used: supermarket sales

What is EDA?

EDA is an approach of analyzing the dataset to summarize the main characteristics of the dataset like the relationship between different features

We can do Eda using statistical graphics and other data visualization methods (can use pandas library or seaborn/matplotlib for the purpose of data visualization)

Why we need EDA?

Exploratory Data Analysis?is valuable to?data?science?projects?since it allows to get closer to the certainty that the future results will be valid, correctly interpreted, and applicable to the desired business contexts

In this article I have taken the supermarket sales dataset to explain the EDA process

Let's first know about our dataset (what dataset contains)

The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.

Attribute information

  1. Invoice id: Computer generated sales slip invoice identification number
  2. Branch: Branch of supercenter (3 branches are available identified by A, B and C).
  3. City: Location of supercenters
  4. Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
  5. Gender: Gender type of customer
  6. Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
  7. Unit price: Price of each product in $
  8. Quantity: Number of products purchased by customer
  9. Tax: 5% tax fee for customer buying
  10. Total: Total price including tax
  11. Date: Date of purchase (Record available from January 2019 to March 2019)
  12. Time: Purchase time (10am to 9pm)
  13. Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)
  14. COGS: Cost of goods sold
  15. Gross margin percentage: Gross margin percentage
  16. Gross income: Gross income
  17. Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

Importing dataset

df=pd.read_csv("C:/Users/DELL/Desktop/PROJECTS/supermarket sales analysis/market.csv")        

what dataset look like?

we can see it using the following code

df.head()        

Output:

No alt text provided for this image


No alt text provided for this image

So here is our data which looks like the above image having many features

lets have some knowledge about the features in the dataset

df.info()        
No alt text provided for this image

this tells us about the datatypes of all the columns

Lets now get some statistical knowledge about our dataset

df.describe()        
No alt text provided for this image

here we get the statistics of our dataset

we can now see the min , max , mean etc of every numerical column

Now check if our dataset contains any null value or not so that we can do treatment of null values by either dropping the fields having null values or by replacing them by a particular value or by replacing them by the mean of that particular column

What happens if we don't treat them rightly?

It may lead to a wrong prediction or may lead to building of a poor model

df.isnull().sum()
        
No alt text provided for this image

so from above result we can conclude that none of our column contains null values

Let's check the various product line(A?product line?in business is a group of related products under the same brand name manufactured by a company) present in our dataset

df.Productline.unique()
        
No alt text provided for this image

The above are the various product line present

lets check the count of products sold from every product line

df.Productline.value_counts()        
No alt text provided for this image

here's the count of every product line we can say from above results that fashion accessories are the most sold products

lets now check the number of products sold from every branch (i.e . branch A,B,C)

df.Branch.value_counts()

        
No alt text provided for this image

Now we can say that branch A has sold maximum items

lets now check which payment method has been used by how many number of users

df.Payment.value_counts()        
No alt text provided for this image

we can see that maximum number of customers have made the payment using Ewallet

Lets now check the correlation (a statistical term describing the degree to which two variables move in coordination with one-another) between every field

df.corr()        
No alt text provided for this image

By seeing the correlation table we can conclude so many things .Lets understand this by taking the total amount column:

  1. Total amount is highly correlated with the tax on it .It is easy to understand that if tax amount increases then total amount will also be increased
  2. Total amount is also positively correlated(meaning if value of one variable increases others value will also be increased) with quantity purchased .It is easy to understand that if quantity increases then total amount will also be increased
  3. Total amount is also positively correlated with unit price It is easy to understand that if unit price increases then total amount will also be increased

Lets now dig deep into this analysis by using data visualization:

Lets check the relation between total amount and gross income

plt.figure(figsize=(10,7))
sns.set(rc={'axes.facecolor':'pink','axes.grid': True,'xtick.labelsize':16})
sns.lineplot(x='Total',y='grossincome',data=df,linewidth=10,color='green')
        
No alt text provided for this image

These two columns are linearly related to each other. we have also seen the same result in the correlation table( their correlation value was 1)

Lets create a histogram( a graph that tells the frequency distribution) for the column name 'Branch'

plt.figure(figsize=(10,7))
sns.histplot(df['Branch'],color='green')        
No alt text provided for this image

We can say that branch A has sold maximum products (same result we got when we analyzed using pandas)

Lets create a histogram for the column name 'Gender'

plt.figure(figsize=(10,9))        

sns.histplot(df['Gender'],color='green')
        
No alt text provided for this image

We can say that Females customers are slightly more than male customers

Lets create a histogram for the column name 'Payment' which denotes payment type

plt.figure(figsize=(10,9))        

sns.histplot(df['Payment'],color='green')        
No alt text provided for this image

We can say that customers have mostly used Ewallet payment method (same result we got when we analyzed using pandas)

Lets create line plot to know the relation between two different columns

  1. here the columns are quantity bought and gross income and we are seeing the result branch wise

plt.figure(figsize=(10,9))        

sns.lineplot(x=df['Quantity'],y=df['grossincome'],hue=df['Branch'],linewidth=10)        
No alt text provided for this image

we can find a kind of linear relationship between the two attributes

2.here the columns unit price and tax

plt.figure(figsize=(10,9))        

sns.lineplot(x=df['Unitprice'],y=df['Tax 5%'])        
No alt text provided for this image

we can find relationship is fluctuating but overall it is kind of linear relationship between the two attributes

Lets create a histogram for the column name 'Productline'

plt.figure(figsize=(10,7))
sns.histplot(df['Productline'],color='green')        

plt.xticks(rotation=90)        
No alt text provided for this image

We can say that fashion accessories have been sold the most (same result we got when we analyzed using pandas)

Lets create a bar plot(bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars) between the fields Productline and Quantity Purchased by a customer

plt.figure(figsize=(10,7))
sns.barplot(x='Productline',y='Quantity',data=df)        

plt.xticks(rotation=90)
        
No alt text provided for this image



Lets create a bar plot between the fields Branch and Rating Purchased by a customer

plt.figure(figsize=(10,7))        

sns.barplot(x='Branch',y='Rating',data=df)        
No alt text provided for this image

We can say that A and C branches are highly rated by customers

Lets create line plot to know the relation between two different columns here the columns are Unit price and rating

plt.figure(figsize=(10,7))        

sns.lineplot(x=df['Unitprice'],y=df['Rating'])        
No alt text provided for this image

Here the result is too much fluctuating so we cant say much about the relation between these two fields

Lets create line plot to know the relation between two different columns here the columns are Unit price and quantity

plt.figure(figsize = (10,7))

        
sns.lineplot(x = 'Unitprice', y = 'Quantity',data = df)        
No alt text provided for this image

Here also the result is too much fluctuating so we cant say much about the relation between these two fields

Lets create line plot to know the relation between two different columns here the columns are grossincome and quantity

plt.figure(figsize = (10,7))

        
sns.lineplot(x = 'grossincome', y = 'Quantity',data = df)        
No alt text provided for this image

Here we can say that there is a high correlation between these (the same result we got in correlation table

In this way we do the EDA part of our projects

So here was small EDA that I have performed .We can do a lot more !!





要查看或添加评论,请登录

Priyanshi Agarwal的更多文章

  • Article on Git and Github

    Article on Git and Github

    Difference between Git and Github:- Git is a version control system used for keeping track of changes of any file…

社区洞察

其他会员也浏览了