Python Practice Project : eCommerce Insights from Flipkart's Sales Data | Descriptive Analytics | Unlocking Revenue Growth Opportunities

Abhishek Mishra

Data Analyst @ Genpact | Ex TCS | Power BI | Python | SQL | Advance Excel

发布日期: 2023年10月21日

Objective :

To uncover business insights from the Flipkart sales data that could inform marketing strategies, inventory planning, and product development for increasing sales and enhancing customer experience.

Key business questions we aim to address:

What are the top selling product categories and brands? This can guide marketing efforts and inventory.
How do prices and discounts vary across categories? This can identify opportunities to adjust pricing.
How do product ratings vary across categories? Low rated categories can be improved.
What are the daily and hourly peak sales times? This can optimize operations and staffing.
What product features or keywords drive more 5-star reviews? This insight can guide product development.
How effective are discounts at driving sales? This can help plan promotion strategies.

Ultimately, the aim is data-driven decisions for better marketing, products, operations, and the customer experience. The analysis reveals patterns for these goals.

Resources:

Dataset: Flipkart Dataset on Kaggle
Jupyter Analysis File: On GitHub
Programming Language: Python
Analytic Tool: Jupyter Notebook

Libraries :

pandas: Optimizes data structure.
plotly and matplotlib: Data visualization.
More libraries are used such as for product description analysis.

Pre-processing the data

Import the necessary libraries.

import pandas as pd
import plotly.express as px
from plotly import graph_objects as go
import matplotlib.pyplot as plt

Load the 20,000 purchase dataset.

df = pd.read_csv("flipkart_com-ecommerce_sample.csv")

df.shape

Output :

(20000, 15)

Interpretation : dataset contains information for 20k purchases record with 15 columns attributes

df.columns

Output :

Index(['uniq_id', 'crawl_timestamp', 'product_url', 'product_name', 'product_category_tree', 'pid', 'retail_price', 'discounted_price', 'image', 'is_FK_Advantage_product', 'description', 'product_rating', 'overall_rating', 'brand', 'product_specifications'], dtype='object')

Key columns for analysis:

crawl_timestamp: Sales timestamp.
product_name: The name or title of the product being sold.
product_category_tree: Category hierarchy
retail_price: The original price of product.
discounted_price: The price of the product after applying discounts.
description: A description of the product, including its key features.
product_rating: 1 to 5 Star ratings.
brand: Manufacturer of the product.

Checking data types of columns

df.dtypes

Output :

uniq_id                     object
crawl_timestamp             object
product_url                 object
product_name                object
product_category_tree       object
pid                         object
retail_price               float64
discounted_price           float64
image                       object
is_FK_Advantage_product       bool
description                 object
product_rating              object
overall_rating              object
brand                       object
product_specifications      object
dtype: object

Observation: Most columns are of object type, but some need conversion, like crawl_timestamp, product_rating, and overall_rating.

df['crawl_timestamp'].head(3)

Output :

0    2016-03-25 22:59:23 +0000
1    2016-03-25 22:59:23 +0000
2    2016-03-25 22:59:23 +0000

Name: crawl_timestamp, dtype: object

converting crawl_timestamp into datetime to extract date and time easily.

df['timestamp']=pd.to_datetime(df['crawl_timestamp'])

Sample Output :

0   2016-03-25   22:59:23+00:00
1   2016-03-25   22:59:23+00:00
2   2016-03-25   22:59:23+00:00
Name: timestamp, dtype: datetime64[ns, UTC]

extracting 'date' and 'time' from timestamp

df['date'] = pd.to_datetime(df['timestamp'].dt.date)

df['time'] = df['timestamp'].dt.time

Sample Output :

dropping the original Date column

df.drop(['crawl_timestamp'], axis = 1,inplace=True)

Check for missing values.

df.isnull().sum()

Output :

uniq_id                       0
product_url                   0
product_name                  0
product_category_tree         0
pid                           0
retail_price                 78
discounted_price             78
image                         3
is_FK_Advantage_product       0
description                   2
product_rating                0
overall_rating                0
brand                      5864
product_specifications       14
timestamp                     0
date                          0
time                          0
dtype: int64

Observation :

retail_price and retail_price having null values. It will affect the analysis operation.
Missing values in other attributes have minimal impact for now.

We will address the missing values of retail_price and discounted_price.

Let's identify products with missing prices.

df[['product_name','date']][pd.isna(df['retail_price'])].head(10)

Output :

Observation : We have a diverse range of products with missing prices. Let's investigate using the median price to fill these null values.

Check the median prices.

print(df["retail_price"].median())

print(df["discounted_price"].median())

Output :

1040.0
550.0

Observation : Assigning median prices to all missing price products doesn't appear suitable due to the wide range of product categories.

Let's examine the dates to which these records with missing prices belong.

df['date'][pd.isna(df['retail_price'])].value_counts()

Output :

date
2015-12-30    16
2015-12-01    14
2015-12-31    12
2016-01-07    10
2016-03-25     4
2016-01-01     4
2015-12-29     4
2016-06-10     2
2016-03-07     1
2015-12-20     1
2016-04-29     1
2016-02-28     1
2016-04-16     1
2016-04-30     1
2016-04-28     1
2016-05-05     1
2016-05-07     1
2016-06-01     1
2016-01-06     1
2016-05-22     1
Name: count, dtype: int64

Observation : Dates are not concentraded to to specic months, indicating a wide range of dates. This means it is safe to delete those missing prices record.

df = df.dropna(subset=['retail_price', 'discounted_price'])

df.shape

Output :

(19922, 17)

Discount Percentage

Adding "discount_percentage" column for valuable insights.

Formula :

Discount = Retail Price - Discounted Price

Discount % = Discount / Retail Price * 100

df = df.assign(discount_percentage=((df['retail_price'] - df['discounted_price']) / df['retail_price']) * 100)

df[['discount_percentage','retail_price','discounted_price']].head()

Output :

Main Category and Subcategory

We have a product category column, but it's hierarchically structured. To make our analysis easier, we'll extract and separate the main category and subcategory.

# Sample record of column 'product_category_tree'

print(df.at[0, 'product_category_tree'])

Output :

["Clothing >> Women's Clothing >> Lingerie, Sleep & Swimwear >> Shorts >> Alisha Shorts >> Alisha Solid Women's Cycling Shorts"]

Generate 'main_category' from 'product_category_tree'

df['main_category'] = df['product_category_tree'].str.extract(r'^\["(.*?)(?: >> |"]\])')

Extract the subcategory

df['subcategory'] = df['product_category_tree'].str.extract(r' >> (.*?)(?: >> |"]\])')

df[['main_category','subcategory']].head()

Output :

    main_category  subcategory
0   Clothing       Women's Clothing
1   Furniture      Furniture
2   Footwear       Women's Footwear
3   Clothing       Women's Clothing
4   Pet Supplies   Grooming

Data preprocessing is done; now we can start the analysis.

Summary Statistics

Calculate numerical data statistics.

df.describe()

Output :

Observation:

The dataset holds 20k purchases.
Retail prices average ?3k with a standard deviation of about ?9k, and discounted prices avg ?2k, std dev ?7k, both ranging from ?350 to ?5,71,230.
Average discounts are 40.5% with a standard deviation of about 23.5%, range 0% to 96.5%, with key percentiles at 21.33% (25th), 45% (median), and 60% (75th).

Categorical Column Analysis

Analyze categorical columns, such as product categories, ratings, and brand.
Compute the count of unique values in each categorical column.

Count of unique values in categorical columns

df['main_category'].nunique()

Output:

Interpretation: There are 32 unique main categories in the dataset.

df['subcategory'].nunique()

Output:

Interpretation: There are 178 unique Subcategories.

df['product_rating'].nunique()

Output :

Interpretation: There are 36 unique product ratings between 1 and 5.

df['brand'].nunique()

Output :

Interpretation: There are 3,485 unique brands in the dataset

Product Categories and Subcategories:

Top-Selling Categories and Subcategories

Identify the top-selling main categories.

# Top 10 Sold Items

TopSels = df[['main_category']].value_counts()[:10]

Output :

main_category             
Clothing                      6171
Jewellery                     3522
Footwear                      1225
Mobiles & Accessories         1097
Automotive                    1010
Home Decor & Festive Needs     927
Beauty and Personal Care       709
Home Furnishing                700
Kitchen & Dining               645
Computers                      573
Name: count, dtype: int64

Visualize top category products.

fig = go.Figure(data=[go.Pie(labels=TopSels.index, values=TopSels)])

fig.update_traces(textposition='inside', textinfo='percent+label', textfont_size=15, pull=[0.05, 0.02, 0, 0], hole=.3)

fig.update_layout(annotations=[dict(text='Product', x=0.50, y=0.5, font_size=20, showarrow=False)], title_text="Top 10 Bestselling Products")

fig.show()

Output :

Observation:

Clothing dominates sales: Clothing is the top-selling category, accounting for 37% of sales.
Jewellery holds a significant share: Jewelry ranks second, capturing 21.2% of sales.
Footwear, mobiles, and automotive follow: Footwear, mobiles, and auto are the next top sellers, making up 24% of sales.
Home and personal care products: Home décor, beauty, home furnishing, kitchen, and computers complete the top 10, totaling about 22% of sales.
In summary, Flipkart caters to diverse customer preferences with a focus on fashion, lifestyle, and tech products.

Identify the top-selling main categories

TopSubs = df['subcategory'].value_counts().head(10)

Output :

subcategory
Women's Clothing                3876
Men's Clothing                  1771
Necklaces & Chains              1601
Accessories & Spare parts        923
Tablet Accessories               799
Women's Footwear                 780
Bangles, Bracelets & Armlets     722
Wrist Watches                    521
Kids' Clothing                   520
Tools                            399
Name: count, dtype: int64

Visualize product distribution in top main categories.

fig = go.Figure(data=[go.Pie(labels=TopSubs.index, values=TopSubs)])

fig.update_traces(textposition='inside', textinfo='percent+label', textfont_size=15, pull=[0.05, 0.02, 0, 0], hole=.3)

fig.update_layout(annotations=[dict(text='SubProd', x=0.50, y=0.5, font_size=20, showarrow=False)], title_text="Top 10 Bestselling Products Within Each Subcategory")

fig.show()

Output :

Observation:

Pie chart reflects strong fashion and lifestyle products demand.
Women's clothing and jewelry are key sales drivers.
Men's clothing and mobile accessories show diverse customer base. Tools in top 10 subcategories show platform's wide appeal to DIYers and tech enthusiasts.

Brand Analysis

Identify the unique brands in the dataset.

# Number of unique brands

unique_brands = df['brand'].nunique()

Output :

Top Brands

List top-selling brands by product count.

# Top 10 Brands in Demand

TopBrands = df['brand'].value_counts()[:10]

Output :

brand
Allure Auto     468
Regular         308
Voylla          299
Slim            284
TheLostPuppy    229
Karatcraft      211
Black           167
White           155
DailyObjects    144
Speedwav        141
Name: count, dtype: int64

Visualize brand-wise product distribution.

fig = go.Figure(data=[go.Pie(labels=TopBrands.index, values=TopBrands)])

fig.update_traces(textposition='inside', textinfo='percent+label', textfont_size=15, pull=[0.05, 0, 0, 0], hole=.3)

fig.update_layout(annotations=[dict(text='Brands', x=0.50, y=0.5, font_size=20, showarrow=False)], title_text="Top 10 Popular Brands")

fig.show()

Output :

Observation:

Allure Auto is the most popular brand capturing over 28% of total sales
Next three brand - Regular, Voylla, and Slim follow closely collectively account for over 25% of total sales
The remaining brands also feature prominently collectively accounting for over 22% of total sales

Discount Analysis by Brand

Top and bottom discount brands.

# fetch top brands consistently offer high discounts

df.groupby('brand')['discount_percentage'].mean().nlargest(10)

Output :

brand
Rajcrafts          96.533333
Bling              94.548458
Instella           91.719745
Bond Beatz         91.596639
KazamaKraft        90.565618
Mydress Mystyle    90.518987
CUBA               90.045023
SDZ                90.045023
Kaizer Jewelry     89.616918
Fash Blush         89.256770
Name: discount_percentage, dtype: float64

Observation:

Rajcrafts is the brand with the highest average discount of 96.53%. The other brands in the top 10 also offer significant discounts, with average discounts ranging from 94.55% to 89.25%.

# Lowest discount giving brands

df.groupby('brand')['discount_percentage'].mean().nsmallest(10)

Output:

brand
720 Armour          0.0
A Bit of Me         0.0
AMZER               0.0
ANASAZI             0.0
ARISE               0.0
ASIAN               0.0
ATHENA              0.0
AUROSHIKHA          0.0
AYESHA              0.0
Aaina Home Decor    0.0
Name: discount_percentage, dtype: float64

Brand Rating Analysis

Obtain unique product ratings.

df['product_rating'].unique()

Output:

array([nan, 5. , 3. , 3.6, 3.5, 4.4, 4. , 1. , 2.3, 4.8, 2.7, 4.5, 2.4, 3.2, 4.7, 2. , 3.3, 4.3, 3.8, 4.2, 3.7, 3.9, 2.5, 3.1, 4.1, 3.4, 4.6, 1.5, 2.8, 2.9, 4.9, 2.2, 2.6, 1.3, 1.7, 1.8])

Count Product with Ratings available

len(df[df['product_rating']!='No rating available'])

Output :

Observation: There are only 1,839 products available with ratings out of 19k purchases.

Find best and least-rated brands

Convert product ratings to a numeric format.

# Convert product ratings to a numeric format

df['product_rating'] = pd.to_numeric(df['overall_rating'], errors='coerce')

df['product_rating'].unique()

Output :

array([nan, 5. , 3. , 3.6, 3.5, 4.4, 4. , 1. , 2.3, 4.8, 2.7, 4.5, 2.4, 3.2, 4.7, 2. , 3.3, 4.3, 3.8, 4.2, 3.7, 3.9, 2.5, 3.1, 4.1, 3.4, 4.6, 1.5, 2.8, 2.9, 4.9, 2.2, 2.6, 1.3, 1.7, 1.8])

Top brands by average product ratings.

# Find the top 5 brands with the highest average product ratings.

df.groupby('brand')['product_rating'].mean().nlargest(5)

Output :

brand
ASIAN                5.0
Aarti Collections    5.0
Alda                 5.0
Ambitione            5.0
Aqua                 5.0
Name: product_rating, dtype: float64

Lowest rated brands

df.groupby('brand')['product_rating'].mean().nsmallest(5)

Output :

brand
Adidas      1.0
Bestech     1.0
Binatone    1.0
D-LINK      1.0
Darkpink    1.0
Name: product_rating, dtype: float64

Pricing Analysis

Price insights: range, discounts, trends.

Price Range Overview

Price stats: retail vs. discount.

df[['retail_price', 'discounted_price']].describe()

Output :

Observation:

Diverse price range (?35 to ?571,230), with products varying from low to high cost, and customers often receive discounts on their purchases. Retail prices exhibit more variability than discounted prices.

Price Distribution Visualization

Visualize retail vs. discount with histograms.

# Create histograms for retail prices and discounted prices

plt.subplot(1, 2, 1)
df['retail_price'].plot.hist(bins=20, title='Retail Price Distribution')
plt.xlabel('Price')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
df['discounted_price'].plot.hist(bins=20, title='Discounted Price Distribution')
plt.xlabel('Price')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

Output:

Observation:

The retail price distribution is skewed to the right, meaning that there are more products with lower prices than products with higher prices.
The discounted price distribution is also skewed to the right, but to a lesser extent than the retail price distribution. This suggests that Flipkart offers discounts on a wide range of products, including both low-priced and high-priced items.
The median retail price is ?1040, while the median discounted price is ?550. This suggests that Flipkart typically offers a discount of around 50% on its products.

Discount Analysis

Find average discount percentage.

df['discount_percentage'].mean()

Output :

40.523885297513765

Find categories with consistent high and low discounts.

df.groupby('main_category')['discount_percentage'].mean().nlargest()

Output :

main_category
Sunglasses                59.578151
Wearable Smart Devices    57.281300
Automotive                54.650991
Mobiles & Accessories     50.994470
Gaming                    48.522793
Name: discount_percentage, dtype: float64

df.groupby('main_category')['discount_percentage'].mean().nsmallest().sort_values(ascending=False)

Output :

main_category
Cameras & Accessories    19.441131
Automation & Robotics    14.995750
Food & Nutrition         11.207317
eBooks                    8.315594
Household Supplies        0.000000
Name: discount_percentage, dtype: float64

Observation:

The average discount percentage for all products is about 40.52%.
Sunglasses, Wearable Smart Devices, and Automotive have the highest average discounts, while Cameras & Accessories, Automation & Robotics have the lowest.
Some categories like Household Supplies have no discounts on average.

Price Trends Over Time

Calculate average retail and discounted prices over time.

price_trends = df.groupby('date')[['retail_price','discounted_price' ]].mean()

Sample Output :

date	  retail_price  discounted_price	
2015-12-01   1563.02     821.99
2015-12-03   1374.77     882.92
2015-12-04   6936.83     5958.81

price_trends['retail_price'].mean()

Output :

2073.1324195725047

price_trends['discounted_price'].mean()

Output :

1253.5775599556516

price_trends['retail_price'].std()

Output :

领英推荐

NEW from Maven Analytics on Medium!

Maven Analytics 1 年前

Data Science vs Software Engineering: Key Differences…

Pratibha Kumari J. 1 年前

How to Become a Highly Skilled Data Scientist?…

Analytics Insight? 5 个月前

1593.0146550077625

price_trends['discounted_price'].std()

Output :

1240.123053769857

Observation:

The dataset shows price trends over time, with an average retail price of ?2k and an average discounted price of ?1k. Retail prices have higher variability (std: ?1.5k) compared to discounted prices (std: ?1.2k).

Plot price trends over time

plt.plot(price_trends.index, price_trends['retail_price'], label='Retail Price', marker='o')

plt.plot(price_trends.index, price_trends['discounted_price'], label='Discounted Price', marker='o')

plt.title('Price Trends Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)

plt.show()

Output :

Observation :

The chart shows that the retail prices are decreasing over time, and discounts are also decreasing but at a slower rate.
Retailers might reduce prices to clear inventory or compete with rivals.
Retail prices peaked in early 2016 and have steadily fallen.
The gap between retail and discounted prices is growing.
The steepest price drop was in late 2016.
The price has been fairly steady recently.

Product Ratings

It provides insights on satisfaction and quality.

Product Rating Overview

Summary statistics for product ratings

rating_stats = df['product_rating'].describe()

Output :

count    1839.000000
mean        3.810332
std         1.260124
min         1.000000
25%         3.000000
50%         4.000000
75%         5.000000
max         5.000000
Name: product_rating, dtype: float64

Visualize the distribution of product ratings

df['product_rating'].plot.hist(bins=20, title='Product Rating Distribution')

plt.xlabel('Product Rating')
plt.ylabel('Count')

plt.show()

Output :

Observation

The product rating distribution is skewed to the right, meaning that there are more products with higher ratings than products with lower ratings.
The median product rating is 4, which suggests that most Flipkart customers are satisfied with their purchases.
There is a wide range of product ratings, with some products having ratings as low as 1 and others having ratings as high as 5

Average Product Ratings by Category

df.groupby('main_category')['product_rating'].mean().nlargest()

Output :

main_category
Home Entertainment    5
Sunglasses            5
eBooks                5
Home & Kitchen        4
Kitchen & Dining      4.37
Name: product_rating, dtype: float64

df.groupby('main_category')['product_rating'].mean().nsmallest().sort_values(ascending=False)

Output :

main_category
Beauty and Personal Care    3.468807
Bags, Wallets & Belts       3.393333
Tools & Hardware            3.376364
Furniture                   2.833333
Home Improvement            2.250000
Name: product_rating, dtype: float64

Observations:

Home Entertainment, Sunglasses, and eBooks have the highest ratings (5.0), while Home Improvement, Furniture, and Tools & Hardware have the lowest (2.25 - 3.38).

Five-Star Rating Analysis

Explore 5-star products and brands.

prod5stars = df['main_category'][df['product_rating'] == 5].value_counts()

brand5stars = df['brand'][df['product_rating'] == 5].value_counts()

# Create a DataFrame to combine the outputs

data = {
    'Products Top 5-Star ': prod5stars.head().index,
    'Products Lowest 5 Star ': prod5stars.tail().index,
    'Brands Top 5 Stars': brand5stars.head().index,
    'Brands Lowest 5-Star Rated ': brand5stars.tail().index
}

Top5Star = pd.DataFrame(data)

Top5Star

Output :

Observation:

most popular product categories on Flipkart are all related to everyday needs and wants, and they offer a wide variety of products to choose from.
least popular product categories are all more specialized, and they may not be as relevant to all shoppers.
most popular brands on Flipkart are all well-known and respected, and they offer a wide variety of products to choose from.
least popular brands are all less well-known, and they may not offer as wide a variety of products to choose from.

5-star Products vs. Total Products

ProductCount = len(df)

Output :

RatedTotal = len(df[df['product_rating'].notna()])

Output :

Top5Stars = len(df[df['product_rating']==5])

Output :

Create a DataFrame for the funnel chart

funnel_data = pd.DataFrame({ 'stage': ['Total Products', 'Products with Rating', '5-Star Products'], 'Count': [ProductCount, RatedTotal, Top5Stars] })

Output :

stage	Count
0	Total Products	     19922
1	Products with Rating	1839
2	5-Star Products	      618

Observation:

The total number of products sold on Flipkart is 19922
The number of products with a rating is 1839. This means that only 9.2% of products have a rating.
The number of 5-star rated products is 618. This means that only 3.1% of products have a 5-star rating.

px.funnel(funnel_data, x='Count', y='stage', title='Product Rating Funnel')

Output :

Interpretation:

Most products on Flipkart lack ratings, likely because customers either don't rate or aren't motivated to.
A small portion of products has 5-star ratings, indicating general customer satisfaction with room for improvement.
The product rating chart suggests room for improvement, with few 5-star ratings, hinting at the need to simplify the rating process and address issues for better customer experience.

Time Analysis

It offers insights into sales trends and customer activity.

Time Trends Overview

Range and summary statistics of timestamps

df['timestamp'].describe(datetime_is_numeric=True)

Output :

Observation :

The median sale in the dataset occurred on 2015-12-31. This means that half of the sales occurred before this date and half of the sales occurred after this date.

Explore the distribution of timestamps.

df['timestamp'].hist(bins=20, grid=False)

plt.title('Timestamp Distribution')
plt.xlabel('Timestamp of Sales')
plt.ylabel('Sales Count')
plt.xticks(rotation=45)

plt.show()

Output :

Observation :

The chart shows that the majority of sales occurred in the first few months of the dataset, with a Sales peaked in late December 2015, gradually declining afterward.
Possible explanation: Data collected during a holiday promotion.
Dataset spans roughly 6 months.
Timestamp distribution skews left, indicating more sales in earlier months.

Daily & Monthly Sales Trends Analysis

Analyze daily product and transaction trends.

# Group data by date and calculate the number of transactions per day

daily_sales_count = df.groupby('date').size()

Sample Output :

date
2015-12-01    4482
2015-12-03     117
2015-12-04     209
2015-12-06     101
2015-12-12     950
2015-12-13     759
2015-12-14      20
2015-12-15      28
2015-12-20     828
2015-12-29     911
dtype: int64

# Group data by month and calculate the number of transactions per month

monthly_sales_count = df.groupby(df['date'].dt.to_period('M')).size()

Output :

date
2015-12    10268
2016-01     4835
2016-02      244
2016-03     1629
2016-04     1133
2016-05      770
2016-06     1043
Freq: M, dtype: int64

Plot daily & Monthly trends

plt.plot(daily_sales_count.index, daily_sales_count, label='Daily Sales Trend', marker='o')

plt.plot(monthly_sales_count.index.to_timestamp(), monthly_sales_count, label='Monthly Sales Trend', marker='o')

plt.title('Daily & Monthly Trends in sales')
plt.xlabel('Date')
plt.ylabel('Sales Count')
plt.legend()
plt.grid(True)

plt.show()

Output :

Observation & Interpretation :

Daily sales: Daily sales fluctuate significantly, with some days seeing over 4000 transactions and other days seeing less than 100 transactions.
Monthly sales: Monthly sales also fluctuate, but to a lesser extent than daily sales. Monthly sales have been increasing over time, with the exception of February 2016.
Daily sales vary due to factors like day, time, and promotions. Monthly trends depend on seasons and company growth.
Overall, sales are increasing month by month, indicating the platform's growing popularity.

Time-Based Analysis of Discounts

Analyze how discounts vary over time.

# Calculate average discounts per year and month

df['year'] = df['timestamp'].dt.year

df['month'] = df['timestamp'].dt.month

df[['year','month']].value_counts().reset_index()

Output :

   year  month     count
0  2015    12      10268
1  2016    1       4835
2  2016    3       1629
3  2016    4       1133
4  2016    6       1043
5  2016    5       770
6  2016    2       244

Observation:

The distribution of data by year and month shows that the majority of the records are from December 2015 (approximately 10,268 records). The data spans to June 2016, but the distribution is not uniform. There is a noticeable drop in records from December 2015 to January 2016, and then a gradual decline in subsequent months.

Calculate average discounts for each month

avg_discount_per_month = df.groupby(['year', 'month'])['discount_percentage'].mean().reset_index()

Output :

   year    month   discount_%
0  2015    12      41.807855
1  2016    1       37.543313
2  2016    2       37.897985
3  2016    3       40.811973
4  2016    4       37.566746
5  2016    5       41.658408
6  2016    6       44.239651

Observation:

In December 2015, the average discount percentage was approximately 41.81%, indicating significant discounts during the holiday season.
From January to April 2016, the average discount percentages ranged from about 37.54% to 40.81%, suggesting a slight decrease in discounts compared to December.
In May and June 2016, there was an increase in the average discount percentage, reaching approximately 41.66% in May and 44.24% in June.

Plot average discounts per month

# Plot average discounts per month

plt.plot(avg_discount_per_month['year'].astype(str) + '-' + avg_discount_per_month['month'].astype(str), avg_discount_per_month['discount_percentage'], label='Average Discount (Monthly)', marker='o')

plt.title('Average Discounts Over the Months')
plt.xlabel('Year-Month')
plt.ylabel('Average Discount')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)

plt.show()

Output :

Interpretation :

Average discounts vary over time with some months seeing higher discounts than others.
Average discounts are highest in December and January. This is likely due to the holiday season and the annual Flipkart Big Billion Days sale.
Average discounts are lowest in April and May. This is likely due to the beginning of the summer season and the end of the financial year.

Purchase activity by the hour of the day

df['Hour'].count()

Output :

Insights into the hours of the day when customers are most active in making purchases.

df['Hour'].value_counts()

Output :

6     3314
0     2241
12    2109
5     1956
11    1580
19    1402
10    1119
8     1103
9     1050
18     751
13     435
7      375
14     369
4      360
3      294
2      278
23     195
22     178
16     176
15     171
17     167
21     140
20     132
1       27
Name: Hour, dtype: int64

Observation:

The dataset contains 19,922 purchase activities. The highest purchase activity occurs at hour 6 (3,314 purchases), followed by hour 0 (2,241). It's valuable for businesses to know when customers are most active for strategic planning.

# Extract the hour from the 'Time' column and create a new 'Hour' column

df['Hour'] = df['Time'].apply(lambda x: x.hour)

# Count the number of purchases for each hour

hourly_counts = df['Hour'].value_counts().sort_index()

# Plot the results

plt.bar(hourly_counts.index, hourly_counts.values)

plt.xlabel('Hour of the Day')
plt.ylabel('Number of Purchases')
plt.title('Purchase Activity by Hour of the Day')
plt.xticks(range(24))
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show()

Output :

Interpretation

Most purchases happen at 6 AM, indicating early morning shopping.
Midnight (Hour 0) sees significant shopping activity.
Lunchtime (Hour 12) is also busy.
Evening shopping peaks around 7 PM (Hour 19).
Late-night and early morning hours have fewer purchases.
Data provides insights into shopping patterns throughout the day, valuable for marketing and inventory management.

Discount Analysis

Analyzing discounts can provide insights into the impact on sales.

Sales Impact

Sales impact represents the additional revenue earned due to offering a discount.

Sales Impact = Original Revenue - Extra Revenue

Here Original Revenue is units were sold at the Retail price. And Extra Revenue is units were sold at the discounted price.

Formula :

Sales Impact = (P - Pd) * D

P = Original retail price
Pd = Discounted retail price
D = Discount percentage as a proportion (between 0 and 1)

df['sales_impact'] = (df['retail_price'] - df['discounted_price']) * df['discount_percentage'] / 100

df[['retail_price', 'discounted_price', 'sales_impact']][:5]

Output :

   retail  discount  sales_impact
0  999.0   379.0     384.784785
1  32157   22646.0   2813.046024
2  999.0   499.0     250.250250
3  699.0   267.0     266.987124
4  220.0   210.0     0.454545

df['sales_impact'].corr(df['discount_percentage'])

Output :

0.3167843236618487

Interpretation:

Sales Impact has a moderate positive correlation with Discount Percentage suggesting that larger discount percentages lead to larger sales impacts.

Visualize the impact of discounts on sales

plt.scatter(df['discount_percentage'], df['sales_impact'], alpha=0.5)

plt.title('Impact of Discounts on Sales')
plt.xlabel('Discount Percentage')
plt.ylabel('Estimated Sales Impact')
plt.grid(True)

plt.show()

Output :

Observation

The chart shows a positive correlation between discount percentage and sales impact. This means that higher discounts tend to lead to higher sales impact.
However, the relationship is not linear. This means that the increase in sales impact slows down as the discount percentage increases.
The highest sales impact occurred at a discount percentage of around 60%
Average sales impact was around ?300
Sales impact varies widely for each discount percentage, indicating that factors beyond discounts, like product category and market conditions, influence sales impact.

Product Description Analysis

Analyze product descriptions for sales and ratings impact.

Word Frequency Analysis

Find common words in descriptions.

Text Processing

Tokenize product descriptions to break them into individual words or phrases.
Remove any stopwords (common words like "the," "and," "in," etc.) from the text as they may not carry much meaning.
Perform stemming to reduce words to their base form.

from collections import Counter

# nltk.download('stopwords')
# nltk.download('punkt')

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

df['description'].head(2)

Output :

0    Key Features of Alisha...
1    FabHomeDecor Fabric...
Name: description, dtype: object

descriptions = df['description'].fillna('').str.split().apply(lambda x: [word.lower() for word in x])

Sample Output :

0        [key, features, of, alisha...
1        [fabhomedecor, fabric...

Name: description, Length: 19922, dtype: object

descriptions_str = [" ".join(description) for description in descriptions]

text = " ".join(descriptions_str)

Tokenize the text into individual words

words = word_tokenize(text)

Remove stopwords

# Remove stopwords

stop_words = set(stopwords.words("english"))

filtered_words = [word for word in words if word.lower() not in stop_words]

Initialize a stemmer.

stemmer = PorterStemmer()

Stem the words.

stemmed_words = [stemmer.stem(word) for word in filtered_words]

Check some most common words.

Counter(stemmed_words).most_common()[:10]

Output :

[('.', 82646),
 (',', 43309),
 (':', 18682),
 ('rs', 15083),
 ("'s", 14538),
 ('-', 13099),
 ('product', 12014),
 ('buy', 11436),
 ('women', 10630),
 ('flipkart.com', 10117)]

More processing of words of descriptions

import string

To remove the common punctuation and numeric symbols from your list of words

# Remove common punctuation and numeric symbols

filtered_words = [word for word in stemmed_words if word not in string.punctuation and not word.isdigit()]

Counter(filtered_words).most_common()[:10]

Output :

[('rs', 15083),
 ("'s", 14538),
 ('product', 12014),
 ('buy', 11436),
 ('women', 10630),
 ('flipkart.com', 10117),
 ('onlin', 9895),
 ('price', 8944),
 ('free', 8500),
 ('day', 8338)]

More reduction of words

# Custom list of words to exclude

words_to_exclude = [ "'s", "rs", "na", "k", "g", "x", "''", "l", "’", "``", "w", "?", "n't", "etc", "...", "3.5", "0.0", "--", "u", "flipkart.com",'product', 'buy', 'onlin', 'price',]

Remove common punctuation and numeric symbols.

filtered_words = [word for word in filtered_words if word not in string.punctuation and not word.isdigit()]

Remove custom excluded words.

filtered_words = [word for word in filtered_words if word not in words_to_exclude]

Count the occurrences of each word.

word_freq = Counter(filtered_words)

Counter(filtered_words).most_common()[:10]

Output :

[('women', 10630),
 ('free', 8500),
 ('day', 8338),
 ('replac', 8008),
 ('genuin', 7979),
 ('deliveri', 7890),
 ('ship', 7852),
 ('cash', 7842),
 ('guarante', 7543),
 ('featur', 7489)]

#  !pip install wordcloud

from wordcloud import WordCloud

wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

plt.show()

Output :

Observation:

The word cloud suggests that the most common products sold on Flipkart are clothing and accessories for women. The words "free" and "day" suggest that these products are often offered with discounts or promotions. The words "replace" and "genuine" suggest that sellers are focused on providing quality products and customer satisfaction. The words "delivery" and "ship" suggest that Flipkart is a popular platform to purchase products quickly and conveniently.
The words "cash," "guarantee," and "feature" suggest that buyers are looking for products that are affordable, reliable, and have the features they need. The words "design" and "brand" suggest that buyers are also interested in purchasing stylish and well-known brands.
The words "color," "detail," and "type" suggest that buyers are looking for a wide variety of products to choose from. The words "specific" and "men" suggest that there is also a demand for products that are tailored to specific needs and demographics.
Overall, the word cloud suggests that Flipkart is a popular platform to purchase a wide variety of clothing and accessories for women. Buyers are looking for products that are affordable, reliable, stylish, and of high quality.

Sentiment Analysis on product descriptions

To identify the overall sentiment of the descriptions (positive, negative, or neutral)

Using pre-trained sentiment analysis models from textblob libraries

#pip install textblob

from textblob import TextBlob

sentiments = []

for description in filtered_words:
    analysis = TextBlob(description)
    if analysis.sentiment.polarity > 0:
        sentiments.append("Positive")
    elif analysis.sentiment.polarity<0:
        sentiments.append("Negative")
    else:
        sentiments.append("Neutral")

Counter(sentiments)

Output :

Counter({'Neutral': 818069, 'Positive': 56489, 'Negative': 20754})

Visualizing Sentiment Distribution

# Count the number of each sentiment

sentiment_counts = {
"Positive": sentiments.count("Positive"),
"Negative": sentiments.count("Negative"),
"Neutral": sentiments.count("Neutral")
}

# Create a bar chart

plt.bar(sentiment_counts.keys(), sentiment_counts.values())

plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.title("Sentiment Distribution of Product Descriptions")

plt.show()

Output :

Interpretation

Majority of product descriptions on Flipkart are neutral or positive.
The vast majority of product descriptions are neutral meaning that they do not express a strong positive or negative sentiment.
The remaining product descriptions are divided between positive and negative. This suggests that buyers are generally satisfied with the products they purchase on Flipkart, but there is still some room for improvement.

Inferences and Conclusion from Flipkart Sales Analysis:

Inferences:

Clothing, Jewellery, Footwear are the top selling categories - this can inform inventory planning and marketing focus areas.
Discounts are effective at driving sales, with higher discounts linked to higher sales impact. Optimal discount percentage seems to be around 60%.
Purchase activity peaks in the mornings and evenings - operations can be optimized for these times.
Brands like Allure Auto and Voylla are popular - partnership and co-marketing opportunities exist.
Most products don't have ratings - better review collection can improve product satisfaction.
Average discounts are lowest in April and May, likely due to the beginning of the summer season and the end of the financial year.
The majority of product descriptions on Flipkart are neutral or positive.

Conclusions:

Focus marketing and inventory on top categories like Clothing, Jewellery, and Footwear to drive more sales.
Offer strategic discounts around 60% to boost sales during low demand periods. Avoid over-discounting products.
Align staffing to match purchase patterns with more staff during mornings and evenings.
Pursue partnerships with leading brands like Allure Auto and Voylla to do co-marketing campaigns.
Improve review collection across products to better understand customer satisfaction and improve product quality.

Overall the analysis provides data-driven insights to optimize operations, marketing, promotions, and the customer experience. Leveraging these insights can help Flipkart achieve the goal of increasing sales and enhancing customer satisfaction.

Recommendations

Flipkart can increase sales by focusing on marketing and promotions during the holiday season.
Flipkart can offer more competitive prices by tracking the prices of competitors and offering discounts on popular products.
Flipkart can improve customer satisfaction by expanding its customer service hours and providing more responsive support.
Flipkart can increase sales by expanding its product selection to include more niche and specialty items.
Flipkart can improve customer satisfaction by providing more detailed product descriptions and images.

Thank you,

Abhishek Mishra

要查看或添加评论，请登录

Abhishek Mishra的更多文章

MySQL & Power BI Project: A Descriptive Analysis of Loan Performance for Portfolio Optimization

2024年5月20日

MySQL & Power BI Project: A Descriptive Analysis of Loan Performance for Portfolio Optimization

Analysis Visuals: INTRODUCTION This analysis uncovers bank loan data to identify factors impacting profitability &…

2 条评论
Excel Project - Decoding Tesla's Decade: An In-Depth Financial Analysis (2009-2023)

2024年2月15日

Excel Project - Decoding Tesla's Decade: An In-Depth Financial Analysis (2009-2023)

Dashboards: Introduction Tesla, the EV pioneer, has disrupted the industry with bold innovation and rapid growth. This…
From Tweets to Trends: An In-depth Excel Analysis of Twitter Activity

2024年1月14日

From Tweets to Trends: An In-depth Excel Analysis of Twitter Activity

Summary Dashboard : Detailed Analysis Reports Engagement Metrics : Klout Metrics : Sentiment Analysis : Temporal…
Excel to the Rescue: Data Analysis Project Unlocking Guest Retention Secrets Through Hotel Booking Cancellation Analysis

2023年12月26日

Excel to the Rescue: Data Analysis Project Unlocking Guest Retention Secrets Through Hotel Booking Cancellation Analysis

Dashboard: Introduction Cancellations are an inevitable reality in the hospitality industry, impacting revenue and…
Python Practice Project : Netflix Stock Data Analysis | Investing Insights | Patterns | Trends | Forecasting

2023年10月9日

Python Practice Project : Netflix Stock Data Analysis | Investing Insights | Patterns | Trends | Forecasting

Exploring Netflix's stock journey from a streaming pioneer to a stock market titan. Objective: Explore Netflix's stock…

1 条评论
Python Practice Project: IPL 2022 Cricket Sports Data Analysis | Unveiling Insights | Statistical Breakdown Beyond the Scoreboard

2023年9月24日

Python Practice Project: IPL 2022 Cricket Sports Data Analysis | Unveiling Insights | Statistical Breakdown Beyond the Scoreboard

Introduction: IPL is a professional Twenty20 cricket league, launched by BCCI in 2008, has 10 teams with brand value in…

6 条评论
Python Practice Project: iPhone Sales Data Analysis | Unlocking Insights

2023年9月18日

Python Practice Project: iPhone Sales Data Analysis | Unlocking Insights

Welcome to this Python practice project focused on iPhone sales analysis. We'll explore key statistics, identify…

4 条评论

See all articles

Objective :

Resources:

Libraries :

Pre-processing the data

Discount Percentage

Main Category and Subcategory

Summary Statistics

Categorical Column Analysis

Count of unique values in categorical columns

Product Categories and Subcategories:

Identify the top-selling main categories

Brand Analysis

Top Brands

Discount Analysis by Brand

Brand Rating Analysis

Pricing Analysis

Price Range Overview

Price Distribution Visualization

Discount Analysis

Price Trends Over Time

领英推荐

Product Ratings

Product Rating Overview

Average Product Ratings by Category

Five-Star Rating Analysis

5-star Products vs. Total Products

Time Analysis

Time Trends Overview

Daily & Monthly Sales Trends Analysis

Time-Based Analysis of Discounts

Purchase activity by the hour of the day

Discount Analysis

Sales Impact

Product Description Analysis

Word Frequency Analysis

Sentiment Analysis on product descriptions

Inferences and Conclusion from Flipkart Sales Analysis:

Recommendations

Abhishek Mishra的更多文章

MySQL & Power BI Project: A Descriptive Analysis of Loan Performance for Portfolio Optimization

Excel Project - Decoding Tesla's Decade: An In-Depth Financial Analysis (2009-2023)

From Tweets to Trends: An In-depth Excel Analysis of Twitter Activity

Excel to the Rescue: Data Analysis Project Unlocking Guest Retention Secrets Through Hotel Booking Cancellation Analysis

Python Practice Project : Netflix Stock Data Analysis | Investing Insights | Patterns | Trends | Forecasting

Python Practice Project: IPL 2022 Cricket Sports Data Analysis | Unveiling Insights | Statistical Breakdown Beyond the Scoreboard

Python Practice Project: iPhone Sales Data Analysis | Unlocking Insights

社区洞察

其他会员也浏览了

Complete Roadmap to Learn Data Science in 2 Months

Bamboolib - an Auto EDA library

65 Best Resources to Learn Data Analysis

9 Best Data Science Roles: A role fit for you

Learn R Language For Data Analysis: Why Is R Useful For Data Science?

Exploring Data with Pandas: Essential EDA Techniques for Data Science

PySpark GroupBy Guide: Super Simple Way to Group Data

Top 8 Data Analytics Tools and Platforms for 2023 | Learning Resources & Real-World Examples

Know how Pandas Profiling makes data exploration easier and more effective.

The Journey to Becoming a Data Analyst: A Roadmap from Basics to Mastery