Exploratory Data Analysis

Exploratory Data Analysis

EDA is a crucial process utilized to summarize, visualize, and comprehend the data being analyzed. It's typically the first step in data analysis and helps data scientists gain a thorough understanding of the data, including its properties and potential applications. In this article, we'll provide an overview of EDA and offer a practical example of its application.

The fundamental objective of EDA is to comprehend the data's nature. To achieve this, data scientists employ various visualization tools, statistical models, and other techniques to explore the data. This process facilitates the identification of patterns, relationships, and trends in the data, which can be leveraged to generate insights and informed decisions.

To illustrate, let's consider a dataset containing information on customer demographics, purchase history, and other pertinent attributes for an online retailer. The aim of the analysis is to determine transactional patterns among customers from various countries.

Step 1 : Examine the data structure and basic statistics

One can utilize basic summary statistics such as mean, median, mode, standard deviation, and range to accomplish this task. Additionally, the data can be represented graphically through histograms, scatter plots, and other methods to obtain a more comprehensive understanding of its distribution and variability.

To begin analyzing the online retailer dataset, one might examine fundamental statistics such as the average purchase amount, purchase frequency, and the age distribution of customers. Relationships between variables like purchase amount and customer age can also be investigated through the creation of scatter plots or heat maps.

As we are aware, real-world data can be disorganized, so it is necessary to clean the data to suit our requirements. Here is a snapshot of the original dataset in a data frame format after loading it.

No alt text provided for this image

While the variable names (column names) may seem self- explanatory, let's delve deeper into each variable's meaning:

InvoiceNo (invoice_num): A unique number assigned to each transaction

StockCode (stock_code): Code for the product being sold

Description (description): Name of the product being sold

Quantity (quantity): The number of products purchased in each transaction

InvoiceDate (invoice_date): The date and time of each transaction

UnitPrice (unit_price): The price of the product per unit

CustomerID (cust_id): A unique identifier assigned to each customer

Country (country): The name of the country where the transaction took place.

No alt text provided for this image

It's evident that certain Customer IDs and Descriptions are absent in the dataset. Consequently, any rows containing any of these missing values will eliminated.

No alt text provided for this image


After comprehensively analyzing the data, we've made the following observations:

  • Quantity has negative values.
  • Unit Price has zero values, indicating free items.

Thus, we will remove any entries with negative values for Quantity as prices can't be negative. Regarding Unit Price, we'll keep the zero values, implying free items. To determine the total amount spent on each purchase, we multiply Quantity by Unit Price:amount_spent = quantity * unit_price.

Finally, we'll add a few new columns containing the Year_Month, Month, Day, and Hour for each transaction, to enable future analysis. The resulting dataframe will appear as follows.

No alt text provided for this image

Step 2 : Data Cleanup and Visualization

After gaining a fundamental comprehension of the data's structure, our subsequent task is to detect any outliers, missing values, or anomalies that could impact our analysis. To achieve this, we can employ several statistical techniques like box plots, scatter plots, and z-score analysis.

In the E-commerce domain, we frequently desire to identify the customers' origin who place the most orders and spend the most money, as they drive the sales of companies. Upon analyzing the results, we discovered that most orders were placed in the UK, and customers from the Netherlands spent the most money on their purchases.

No alt text provided for this image
Top 5 customers with highest money spent

Step 3 : Statistical Analysis

After identifying outliers and anomalies, the next step is to develop hypotheses and test them using the data. This can be done using statistical tests like hypothesis testing or regression analysis, or it can be done using machine learning models to identify patterns and relationships in the data.

For example, in the online retailer dataset, we might develop hypotheses about the factors that are driving customer purchases, such as time of the day, months etc. We might then use regression analysis to test these hypotheses and identify the most important factors.

No alt text provided for this image
No alt text provided for this image
Number of orders for different hours

Regarding the transactions' timing, there are no records of any transactions between 8:00 pm and 6:00 am on the next day. Additionally, we observed that the firm receives the most orders at 12:00 pm, which could be attributed to the fact that many customers make purchases during lunch hours, typically between 12:00 pm and 2:00 pm.

Step 4 : Interpretation and Conclusion

Once we have thoroughly analyzed the data and obtained valuable insights, we can leverage these findings to make informed decisions and recommendations. This could involve devising targeted marketing campaigns that cater to specific customer segments or tweaking pricing strategies based on customer behavior patterns.

For instance, in the current example:

  • The customer from Netherlands spends the highest amount on purchases
  • Sales were at their peak in November 2011
  • The company tends to receive more orders from Monday to Thursday and fewer afterward
  • The firm receives the most orders at 12:00 pm, likely because most customers make purchases during lunch hours, typically between 12:00 pm and 2:00 pm.

In conclusion, EDA is a critical process that is used to explore and understand the nature of the data that is being analyzed. By using a combination of visualization tools, statistical models, and other techniques, data scientists can identify patterns, trends, and relationships in the data that can be used to develop insights and make informed decisions.

Example taken from - https://towardsdatascience.com/exploratory-data-analysis-on-e-commerce-data-be24c72b32b2

要查看或添加评论,请登录

CTO in a Box的更多文章

社区洞察

其他会员也浏览了