登录查看更多内容

EXLORATORY DATA ANALYSIS

Priyanshi Agarwal

SE at Lowe's India ||Ex-Intern at Hewlett Packard Enterprise||Mody University Student

发布日期: 2021年6月23日

Data Set used: supermarket sales

What is EDA?

EDA is an approach of analyzing the dataset to summarize the main characteristics of the dataset like the relationship between different features

We can do Eda using statistical graphics and other data visualization methods (can use pandas library or seaborn/matplotlib for the purpose of data visualization)

Why we need EDA?

Exploratory Data Analysis?is valuable to?data?science?projects?since it allows to get closer to the certainty that the future results will be valid, correctly interpreted, and applicable to the desired business contexts

In this article I have taken the supermarket sales dataset to explain the EDA process

Let's first know about our dataset (what dataset contains)

The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.

Attribute information

Invoice id: Computer generated sales slip invoice identification number
Branch: Branch of supercenter (3 branches are available identified by A, B and C).
City: Location of supercenters
Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
Gender: Gender type of customer
Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
Unit price: Price of each product in $
Quantity: Number of products purchased by customer
Tax: 5% tax fee for customer buying
Total: Total price including tax
Date: Date of purchase (Record available from January 2019 to March 2019)
Time: Purchase time (10am to 9pm)
Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)
COGS: Cost of goods sold
Gross margin percentage: Gross margin percentage
Gross income: Gross income
Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

Importing dataset

df=pd.read_csv("C:/Users/DELL/Desktop/PROJECTS/supermarket sales analysis/market.csv")

what dataset look like?

we can see it using the following code

df.head()

Output:

So here is our data which looks like the above image having many features

lets have some knowledge about the features in the dataset

df.info()

this tells us about the datatypes of all the columns

Lets now get some statistical knowledge about our dataset

df.describe()

here we get the statistics of our dataset

we can now see the min , max , mean etc of every numerical column

Now check if our dataset contains any null value or not so that we can do treatment of null values by either dropping the fields having null values or by replacing them by a particular value or by replacing them by the mean of that particular column

What happens if we don't treat them rightly?

It may lead to a wrong prediction or may lead to building of a poor model

df.isnull().sum()

so from above result we can conclude that none of our column contains null values

Let's check the various product line(A?product line?in business is a group of related products under the same brand name manufactured by a company) present in our dataset

df.Productline.unique()

The above are the various product line present

lets check the count of products sold from every product line

df.Productline.value_counts()

here's the count of every product line we can say from above results that fashion accessories are the most sold products

lets now check the number of products sold from every branch (i.e . branch A,B,C)

df.Branch.value_counts()

Now we can say that branch A has sold maximum items

lets now check which payment method has been used by how many number of users

df.Payment.value_counts()

we can see that maximum number of customers have made the payment using Ewallet

Lets now check the correlation (a statistical term describing the degree to which two variables move in coordination with one-another) between every field

df.corr()

By seeing the correlation table we can conclude so many things .Lets understand this by taking the total amount column:

Total amount is highly correlated with the tax on it .It is easy to understand that if tax amount increases then total amount will also be increased
Total amount is also positively correlated(meaning if value of one variable increases others value will also be increased) with quantity purchased .It is easy to understand that if quantity increases then total amount will also be increased
Total amount is also positively correlated with unit price It is easy to understand that if unit price increases then total amount will also be increased

Lets now dig deep into this analysis by using data visualization:

Lets check the relation between total amount and gross income

plt.figure(figsize=(10,7))
sns.set(rc={'axes.facecolor':'pink','axes.grid': True,'xtick.labelsize':16})
sns.lineplot(x='Total',y='grossincome',data=df,linewidth=10,color='green')

These two columns are linearly related to each other. we have also seen the same result in the correlation table( their correlation value was 1)

领英推荐

5 Great Ways Data Science Boosts Business Opportunities

ThinkPalm Technologies Pvt. Ltd. 1 年前

Top 5 Most Used Sampling Techniques in Data Science

SURESH BEEKHANI 3 个月前

Mastering Data Analysis: Transforming Raw Data into…

Tousif Anwar 3 个月前

Lets create a histogram( a graph that tells the frequency distribution) for the column name 'Branch'

plt.figure(figsize=(10,7))
sns.histplot(df['Branch'],color='green')

We can say that branch A has sold maximum products (same result we got when we analyzed using pandas)

Lets create a histogram for the column name 'Gender'

plt.figure(figsize=(10,9))


sns.histplot(df['Gender'],color='green')

We can say that Females customers are slightly more than male customers

Lets create a histogram for the column name 'Payment' which denotes payment type

plt.figure(figsize=(10,9))


sns.histplot(df['Payment'],color='green')

We can say that customers have mostly used Ewallet payment method (same result we got when we analyzed using pandas)

Lets create line plot to know the relation between two different columns

here the columns are quantity bought and gross income and we are seeing the result branch wise

plt.figure(figsize=(10,9))


sns.lineplot(x=df['Quantity'],y=df['grossincome'],hue=df['Branch'],linewidth=10)

we can find a kind of linear relationship between the two attributes

2.here the columns unit price and tax

plt.figure(figsize=(10,9))


sns.lineplot(x=df['Unitprice'],y=df['Tax 5%'])

we can find relationship is fluctuating but overall it is kind of linear relationship between the two attributes

Lets create a histogram for the column name 'Productline'

plt.figure(figsize=(10,7))
sns.histplot(df['Productline'],color='green')


plt.xticks(rotation=90)

We can say that fashion accessories have been sold the most (same result we got when we analyzed using pandas)

Lets create a bar plot(bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars) between the fields Productline and Quantity Purchased by a customer

plt.figure(figsize=(10,7))
sns.barplot(x='Productline',y='Quantity',data=df)


plt.xticks(rotation=90)

Lets create a bar plot between the fields Branch and Rating Purchased by a customer

plt.figure(figsize=(10,7))


sns.barplot(x='Branch',y='Rating',data=df)

We can say that A and C branches are highly rated by customers

Lets create line plot to know the relation between two different columns here the columns are Unit price and rating

plt.figure(figsize=(10,7))


sns.lineplot(x=df['Unitprice'],y=df['Rating'])

Here the result is too much fluctuating so we cant say much about the relation between these two fields

Lets create line plot to know the relation between two different columns here the columns are Unit price and quantity

plt.figure(figsize = (10,7))

sns.lineplot(x = 'Unitprice', y = 'Quantity',data = df)

Here also the result is too much fluctuating so we cant say much about the relation between these two fields

Lets create line plot to know the relation between two different columns here the columns are grossincome and quantity

plt.figure(figsize = (10,7))

sns.lineplot(x = 'grossincome', y = 'Quantity',data = df)

Here we can say that there is a high correlation between these (the same result we got in correlation table

In this way we do the EDA part of our projects

So here was small EDA that I have performed .We can do a lot more !!

Priyanshi Agarwal的更多文章

Article on Git and Github

2021年6月14日

Article on Git and Github

Difference between Git and Github:- Git is a version control system used for keeping track of changes of any file…

EXLORATORY DATA ANALYSIS

Priyanshi Agarwal

SE at Lowe's India ||Ex-Intern at Hewlett Packard Enterprise||Mody University Student

领英推荐

Priyanshi Agarwal的更多文章

社区洞察

其他会员也浏览了

TOP 5 INDUSTRIES USING DATA ANALYTICS

From Raw Data to Business Insights: The Thrilling World of Data Analysis ??

Unlocking Insights: A Step-by-Step Framework for Data Analysis

"Understanding Data: Types, Collection Methods, and Measurement Scales"

Data Analytics: Exploratory Data Analysis (EDA)

Conversing with Data through Exploratory Data Analysis

How to Explain Data Analytics Project in Interview

A Guide to Data Analysis in Today's World

The Power of Data Analysis: Elevating Decision-Making to New Heights

Time Series Analysis in Data Science: Understanding Trends Over Time

领英推荐

Priyanshi Agarwal的更多文章

Article on Git and Github

社区洞察

其他会员也浏览了

TOP 5 INDUSTRIES USING DATA ANALYTICS

From Raw Data to Business Insights: The Thrilling World of Data Analysis ??

Unlocking Insights: A Step-by-Step Framework for Data Analysis

"Understanding Data: Types, Collection Methods, and Measurement Scales"

Data Analytics: Exploratory Data Analysis (EDA)

Conversing with Data through Exploratory Data Analysis

How to Explain Data Analytics Project in Interview

A Guide to Data Analysis in Today's World

The Power of Data Analysis: Elevating Decision-Making to New Heights

Time Series Analysis in Data Science: Understanding Trends Over Time