Python Practice Project : eCommerce Insights from Flipkart's Sales Data | Descriptive Analytics | Unlocking Revenue Growth Opportunities
Abhishek Mishra
Data Analyst @ Genpact | Ex TCS | Power BI | Python | SQL | Advance Excel
Objective :
To uncover business insights from the Flipkart sales data that could inform marketing strategies, inventory planning, and product development for increasing sales and enhancing customer experience.
Key business questions we aim to address:
Ultimately, the aim is data-driven decisions for better marketing, products, operations, and the customer experience. The analysis reveals patterns for these goals.
Resources:
Libraries :
Pre-processing the data
Import the necessary libraries.
import pandas as pd
import plotly.express as px
from plotly import graph_objects as go
import matplotlib.pyplot as plt
Load the 20,000 purchase dataset.
df = pd.read_csv("flipkart_com-ecommerce_sample.csv")
df.shape
Output :
(20000, 15)
Interpretation : dataset contains information for 20k purchases record with 15 columns attributes
df.columns
Output :
Index(['uniq_id', 'crawl_timestamp', 'product_url', 'product_name', 'product_category_tree', 'pid', 'retail_price', 'discounted_price', 'image', 'is_FK_Advantage_product', 'description', 'product_rating', 'overall_rating', 'brand', 'product_specifications'], dtype='object')
Key columns for analysis:
Checking data types of columns
df.dtypes
Output :
uniq_id object
crawl_timestamp object
product_url object
product_name object
product_category_tree object
pid object
retail_price float64
discounted_price float64
image object
is_FK_Advantage_product bool
description object
product_rating object
overall_rating object
brand object
product_specifications object
dtype: object
Observation: Most columns are of object type, but some need conversion, like crawl_timestamp, product_rating, and overall_rating.
df['crawl_timestamp'].head(3)
Output :
0 2016-03-25 22:59:23 +0000
1 2016-03-25 22:59:23 +0000
2 2016-03-25 22:59:23 +0000
Name: crawl_timestamp, dtype: object
converting crawl_timestamp into datetime to extract date and time easily.
df['timestamp']=pd.to_datetime(df['crawl_timestamp'])
Sample Output :
0 2016-03-25 22:59:23+00:00
1 2016-03-25 22:59:23+00:00
2 2016-03-25 22:59:23+00:00
Name: timestamp, dtype: datetime64[ns, UTC]
extracting 'date' and 'time' from timestamp
df['date'] = pd.to_datetime(df['timestamp'].dt.date)
df['time'] = df['timestamp'].dt.time
Sample Output :
dropping the original Date column
df.drop(['crawl_timestamp'], axis = 1,inplace=True)
Check for missing values.
df.isnull().sum()
Output :
uniq_id 0
product_url 0
product_name 0
product_category_tree 0
pid 0
retail_price 78
discounted_price 78
image 3
is_FK_Advantage_product 0
description 2
product_rating 0
overall_rating 0
brand 5864
product_specifications 14
timestamp 0
date 0
time 0
dtype: int64
Observation :
We will address the missing values of retail_price and discounted_price.
Let's identify products with missing prices.
df[['product_name','date']][pd.isna(df['retail_price'])].head(10)
Output :
Observation : We have a diverse range of products with missing prices. Let's investigate using the median price to fill these null values.
Check the median prices.
print(df["retail_price"].median())
print(df["discounted_price"].median())
Output :
1040.0
550.0
Observation : Assigning median prices to all missing price products doesn't appear suitable due to the wide range of product categories.
Let's examine the dates to which these records with missing prices belong.
df['date'][pd.isna(df['retail_price'])].value_counts()
Output :
date
2015-12-30 16
2015-12-01 14
2015-12-31 12
2016-01-07 10
2016-03-25 4
2016-01-01 4
2015-12-29 4
2016-06-10 2
2016-03-07 1
2015-12-20 1
2016-04-29 1
2016-02-28 1
2016-04-16 1
2016-04-30 1
2016-04-28 1
2016-05-05 1
2016-05-07 1
2016-06-01 1
2016-01-06 1
2016-05-22 1
Name: count, dtype: int64
Observation : Dates are not concentraded to to specic months, indicating a wide range of dates. This means it is safe to delete those missing prices record.
df = df.dropna(subset=['retail_price', 'discounted_price'])
df.shape
Output :
(19922, 17)
Discount Percentage
Adding "discount_percentage" column for valuable insights.
Formula :
Discount = Retail Price - Discounted Price
Discount % = Discount / Retail Price * 100
df = df.assign(discount_percentage=((df['retail_price'] - df['discounted_price']) / df['retail_price']) * 100)
df[['discount_percentage','retail_price','discounted_price']].head()
Output :
Main Category and Subcategory
We have a product category column, but it's hierarchically structured. To make our analysis easier, we'll extract and separate the main category and subcategory.
# Sample record of column 'product_category_tree'
print(df.at[0, 'product_category_tree'])
Output :
["Clothing >> Women's Clothing >> Lingerie, Sleep & Swimwear >> Shorts >> Alisha Shorts >> Alisha Solid Women's Cycling Shorts"]
Generate 'main_category' from 'product_category_tree'
df['main_category'] = df['product_category_tree'].str.extract(r'^\["(.*?)(?: >> |"]\])')
Extract the subcategory
df['subcategory'] = df['product_category_tree'].str.extract(r' >> (.*?)(?: >> |"]\])')
df[['main_category','subcategory']].head()
Output :
main_category subcategory
0 Clothing Women's Clothing
1 Furniture Furniture
2 Footwear Women's Footwear
3 Clothing Women's Clothing
4 Pet Supplies Grooming
Data preprocessing is done; now we can start the analysis.
Summary Statistics
Calculate numerical data statistics.
df.describe()
Output :
Observation:
Categorical Column Analysis
Count of unique values in categorical columns
df['main_category'].nunique()
Output:
32
Interpretation: There are 32 unique main categories in the dataset.
df['subcategory'].nunique()
Output:
178
Interpretation: There are 178 unique Subcategories.
df['product_rating'].nunique()
Output :
36
Interpretation: There are 36 unique product ratings between 1 and 5.
df['brand'].nunique()
Output :
3485
Interpretation: There are 3,485 unique brands in the dataset
Product Categories and Subcategories:
Top-Selling Categories and Subcategories
Identify the top-selling main categories.
# Top 10 Sold Items
TopSels = df[['main_category']].value_counts()[:10]
Output :
main_category
Clothing 6171
Jewellery 3522
Footwear 1225
Mobiles & Accessories 1097
Automotive 1010
Home Decor & Festive Needs 927
Beauty and Personal Care 709
Home Furnishing 700
Kitchen & Dining 645
Computers 573
Name: count, dtype: int64
Visualize top category products.
fig = go.Figure(data=[go.Pie(labels=TopSels.index, values=TopSels)])
fig.update_traces(textposition='inside', textinfo='percent+label', textfont_size=15, pull=[0.05, 0.02, 0, 0], hole=.3)
fig.update_layout(annotations=[dict(text='Product', x=0.50, y=0.5, font_size=20, showarrow=False)], title_text="Top 10 Bestselling Products")
fig.show()
Output :
Observation:
Identify the top-selling main categories
TopSubs = df['subcategory'].value_counts().head(10)
Output :
subcategory
Women's Clothing 3876
Men's Clothing 1771
Necklaces & Chains 1601
Accessories & Spare parts 923
Tablet Accessories 799
Women's Footwear 780
Bangles, Bracelets & Armlets 722
Wrist Watches 521
Kids' Clothing 520
Tools 399
Name: count, dtype: int64
Visualize product distribution in top main categories.
fig = go.Figure(data=[go.Pie(labels=TopSubs.index, values=TopSubs)])
fig.update_traces(textposition='inside', textinfo='percent+label', textfont_size=15, pull=[0.05, 0.02, 0, 0], hole=.3)
fig.update_layout(annotations=[dict(text='SubProd', x=0.50, y=0.5, font_size=20, showarrow=False)], title_text="Top 10 Bestselling Products Within Each Subcategory")
fig.show()
Output :
Observation:
Brand Analysis
Identify the unique brands in the dataset.
# Number of unique brands
unique_brands = df['brand'].nunique()
Output :
3485
Top Brands
List top-selling brands by product count.
# Top 10 Brands in Demand
TopBrands = df['brand'].value_counts()[:10]
Output :
brand
Allure Auto 468
Regular 308
Voylla 299
Slim 284
TheLostPuppy 229
Karatcraft 211
Black 167
White 155
DailyObjects 144
Speedwav 141
Name: count, dtype: int64
Visualize brand-wise product distribution.
fig = go.Figure(data=[go.Pie(labels=TopBrands.index, values=TopBrands)])
fig.update_traces(textposition='inside', textinfo='percent+label', textfont_size=15, pull=[0.05, 0, 0, 0], hole=.3)
fig.update_layout(annotations=[dict(text='Brands', x=0.50, y=0.5, font_size=20, showarrow=False)], title_text="Top 10 Popular Brands")
fig.show()
Output :
Observation:
Discount Analysis by Brand
Top and bottom discount brands.
# fetch top brands consistently offer high discounts
df.groupby('brand')['discount_percentage'].mean().nlargest(10)
Output :
brand
Rajcrafts 96.533333
Bling 94.548458
Instella 91.719745
Bond Beatz 91.596639
KazamaKraft 90.565618
Mydress Mystyle 90.518987
CUBA 90.045023
SDZ 90.045023
Kaizer Jewelry 89.616918
Fash Blush 89.256770
Name: discount_percentage, dtype: float64
Observation:
# Lowest discount giving brands
df.groupby('brand')['discount_percentage'].mean().nsmallest(10)
Output:
brand
720 Armour 0.0
A Bit of Me 0.0
AMZER 0.0
ANASAZI 0.0
ARISE 0.0
ASIAN 0.0
ATHENA 0.0
AUROSHIKHA 0.0
AYESHA 0.0
Aaina Home Decor 0.0
Name: discount_percentage, dtype: float64
Brand Rating Analysis
Obtain unique product ratings.
df['product_rating'].unique()
Output:
array([nan, 5. , 3. , 3.6, 3.5, 4.4, 4. , 1. , 2.3, 4.8, 2.7, 4.5, 2.4, 3.2, 4.7, 2. , 3.3, 4.3, 3.8, 4.2, 3.7, 3.9, 2.5, 3.1, 4.1, 3.4, 4.6, 1.5, 2.8, 2.9, 4.9, 2.2, 2.6, 1.3, 1.7, 1.8])
Count Product with Ratings available
len(df[df['product_rating']!='No rating available'])
Output :
1839
Observation: There are only 1,839 products available with ratings out of 19k purchases.
Find best and least-rated brands
Convert product ratings to a numeric format.
# Convert product ratings to a numeric format
df['product_rating'] = pd.to_numeric(df['overall_rating'], errors='coerce')
df['product_rating'].unique()
Output :
array([nan, 5. , 3. , 3.6, 3.5, 4.4, 4. , 1. , 2.3, 4.8, 2.7, 4.5, 2.4, 3.2, 4.7, 2. , 3.3, 4.3, 3.8, 4.2, 3.7, 3.9, 2.5, 3.1, 4.1, 3.4, 4.6, 1.5, 2.8, 2.9, 4.9, 2.2, 2.6, 1.3, 1.7, 1.8])
Top brands by average product ratings.
# Find the top 5 brands with the highest average product ratings.
df.groupby('brand')['product_rating'].mean().nlargest(5)
Output :
brand
ASIAN 5.0
Aarti Collections 5.0
Alda 5.0
Ambitione 5.0
Aqua 5.0
Name: product_rating, dtype: float64
Lowest rated brands
df.groupby('brand')['product_rating'].mean().nsmallest(5)
Output :
brand
Adidas 1.0
Bestech 1.0
Binatone 1.0
D-LINK 1.0
Darkpink 1.0
Name: product_rating, dtype: float64
Pricing Analysis
Price insights: range, discounts, trends.
Price Range Overview
Price stats: retail vs. discount.
df[['retail_price', 'discounted_price']].describe()
Output :
Observation:
Price Distribution Visualization
Visualize retail vs. discount with histograms.
# Create histograms for retail prices and discounted prices
plt.subplot(1, 2, 1)
df['retail_price'].plot.hist(bins=20, title='Retail Price Distribution')
plt.xlabel('Price')
plt.ylabel('Count')
plt.subplot(1, 2, 2)
df['discounted_price'].plot.hist(bins=20, title='Discounted Price Distribution')
plt.xlabel('Price')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
Output:
Observation:
Discount Analysis
Find average discount percentage.
df['discount_percentage'].mean()
Output :
40.523885297513765
Find categories with consistent high and low discounts.
df.groupby('main_category')['discount_percentage'].mean().nlargest()
Output :
main_category
Sunglasses 59.578151
Wearable Smart Devices 57.281300
Automotive 54.650991
Mobiles & Accessories 50.994470
Gaming 48.522793
Name: discount_percentage, dtype: float64
df.groupby('main_category')['discount_percentage'].mean().nsmallest().sort_values(ascending=False)
Output :
main_category
Cameras & Accessories 19.441131
Automation & Robotics 14.995750
Food & Nutrition 11.207317
eBooks 8.315594
Household Supplies 0.000000
Name: discount_percentage, dtype: float64
Observation:
Price Trends Over Time
Calculate average retail and discounted prices over time.
price_trends = df.groupby('date')[['retail_price','discounted_price' ]].mean()
Sample Output :
date retail_price discounted_price
2015-12-01 1563.02 821.99
2015-12-03 1374.77 882.92
2015-12-04 6936.83 5958.81
price_trends['retail_price'].mean()
Output :
2073.1324195725047
price_trends['discounted_price'].mean()
Output :
1253.5775599556516
price_trends['retail_price'].std()
Output :
领英推荐
1593.0146550077625
price_trends['discounted_price'].std()
Output :
1240.123053769857
Observation:
Plot price trends over time
plt.plot(price_trends.index, price_trends['retail_price'], label='Retail Price', marker='o')
plt.plot(price_trends.index, price_trends['discounted_price'], label='Discounted Price', marker='o')
plt.title('Price Trends Over Time')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True)
plt.show()
Output :
Observation :
Product Ratings
It provides insights on satisfaction and quality.
Product Rating Overview
Summary statistics for product ratings
rating_stats = df['product_rating'].describe()
Output :
count 1839.000000
mean 3.810332
std 1.260124
min 1.000000
25% 3.000000
50% 4.000000
75% 5.000000
max 5.000000
Name: product_rating, dtype: float64
Visualize the distribution of product ratings
df['product_rating'].plot.hist(bins=20, title='Product Rating Distribution')
plt.xlabel('Product Rating')
plt.ylabel('Count')
plt.show()
Output :
Observation
Average Product Ratings by Category
df.groupby('main_category')['product_rating'].mean().nlargest()
Output :
main_category
Home Entertainment 5
Sunglasses 5
eBooks 5
Home & Kitchen 4
Kitchen & Dining 4.37
Name: product_rating, dtype: float64
df.groupby('main_category')['product_rating'].mean().nsmallest().sort_values(ascending=False)
Output :
main_category
Beauty and Personal Care 3.468807
Bags, Wallets & Belts 3.393333
Tools & Hardware 3.376364
Furniture 2.833333
Home Improvement 2.250000
Name: product_rating, dtype: float64
Observations:
Five-Star Rating Analysis
Explore 5-star products and brands.
prod5stars = df['main_category'][df['product_rating'] == 5].value_counts()
brand5stars = df['brand'][df['product_rating'] == 5].value_counts()
# Create a DataFrame to combine the outputs
data = {
'Products Top 5-Star ': prod5stars.head().index,
'Products Lowest 5 Star ': prod5stars.tail().index,
'Brands Top 5 Stars': brand5stars.head().index,
'Brands Lowest 5-Star Rated ': brand5stars.tail().index
}
Top5Star = pd.DataFrame(data)
Top5Star
Output :
Observation:
5-star Products vs. Total Products
ProductCount = len(df)
Output :
19922
RatedTotal = len(df[df['product_rating'].notna()])
Output :
1839
Top5Stars = len(df[df['product_rating']==5])
Output :
618
Create a DataFrame for the funnel chart
funnel_data = pd.DataFrame({ 'stage': ['Total Products', 'Products with Rating', '5-Star Products'], 'Count': [ProductCount, RatedTotal, Top5Stars] })
Output :
stage Count
0 Total Products 19922
1 Products with Rating 1839
2 5-Star Products 618
Observation:
px.funnel(funnel_data, x='Count', y='stage', title='Product Rating Funnel')
Output :
Interpretation:
Time Analysis
It offers insights into sales trends and customer activity.
Time Trends Overview
Range and summary statistics of timestamps
df['timestamp'].describe(datetime_is_numeric=True)
Output :
Observation :
Explore the distribution of timestamps.
df['timestamp'].hist(bins=20, grid=False)
plt.title('Timestamp Distribution')
plt.xlabel('Timestamp of Sales')
plt.ylabel('Sales Count')
plt.xticks(rotation=45)
plt.show()
Output :
Observation :
Daily & Monthly Sales Trends Analysis
Analyze daily product and transaction trends.
# Group data by date and calculate the number of transactions per day
daily_sales_count = df.groupby('date').size()
Sample Output :
date
2015-12-01 4482
2015-12-03 117
2015-12-04 209
2015-12-06 101
2015-12-12 950
2015-12-13 759
2015-12-14 20
2015-12-15 28
2015-12-20 828
2015-12-29 911
dtype: int64
# Group data by month and calculate the number of transactions per month
monthly_sales_count = df.groupby(df['date'].dt.to_period('M')).size()
Output :
date
2015-12 10268
2016-01 4835
2016-02 244
2016-03 1629
2016-04 1133
2016-05 770
2016-06 1043
Freq: M, dtype: int64
Plot daily & Monthly trends
plt.plot(daily_sales_count.index, daily_sales_count, label='Daily Sales Trend', marker='o')
plt.plot(monthly_sales_count.index.to_timestamp(), monthly_sales_count, label='Monthly Sales Trend', marker='o')
plt.title('Daily & Monthly Trends in sales')
plt.xlabel('Date')
plt.ylabel('Sales Count')
plt.legend()
plt.grid(True)
plt.show()
Output :
Observation & Interpretation :
Time-Based Analysis of Discounts
Analyze how discounts vary over time.
# Calculate average discounts per year and month
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df[['year','month']].value_counts().reset_index()
Output :
year month count
0 2015 12 10268
1 2016 1 4835
2 2016 3 1629
3 2016 4 1133
4 2016 6 1043
5 2016 5 770
6 2016 2 244
Observation:
The distribution of data by year and month shows that the majority of the records are from December 2015 (approximately 10,268 records). The data spans to June 2016, but the distribution is not uniform. There is a noticeable drop in records from December 2015 to January 2016, and then a gradual decline in subsequent months.
Calculate average discounts for each month
avg_discount_per_month = df.groupby(['year', 'month'])['discount_percentage'].mean().reset_index()
Output :
year month discount_%
0 2015 12 41.807855
1 2016 1 37.543313
2 2016 2 37.897985
3 2016 3 40.811973
4 2016 4 37.566746
5 2016 5 41.658408
6 2016 6 44.239651
Observation:
Plot average discounts per month
# Plot average discounts per month
plt.plot(avg_discount_per_month['year'].astype(str) + '-' + avg_discount_per_month['month'].astype(str), avg_discount_per_month['discount_percentage'], label='Average Discount (Monthly)', marker='o')
plt.title('Average Discounts Over the Months')
plt.xlabel('Year-Month')
plt.ylabel('Average Discount')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.show()
Output :
Interpretation :
Purchase activity by the hour of the day
df['Hour'].count()
Output :
19922
Insights into the hours of the day when customers are most active in making purchases.
df['Hour'].value_counts()
Output :
6 3314
0 2241
12 2109
5 1956
11 1580
19 1402
10 1119
8 1103
9 1050
18 751
13 435
7 375
14 369
4 360
3 294
2 278
23 195
22 178
16 176
15 171
17 167
21 140
20 132
1 27
Name: Hour, dtype: int64
Observation:
# Extract the hour from the 'Time' column and create a new 'Hour' column
df['Hour'] = df['Time'].apply(lambda x: x.hour)
# Count the number of purchases for each hour
hourly_counts = df['Hour'].value_counts().sort_index()
# Plot the results
plt.bar(hourly_counts.index, hourly_counts.values)
plt.xlabel('Hour of the Day')
plt.ylabel('Number of Purchases')
plt.title('Purchase Activity by Hour of the Day')
plt.xticks(range(24))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Output :
Interpretation
Discount Analysis
Analyzing discounts can provide insights into the impact on sales.
Sales Impact
Sales impact represents the additional revenue earned due to offering a discount.
Here Original Revenue is units were sold at the Retail price. And Extra Revenue is units were sold at the discounted price.
Formula :
Sales Impact = (P - Pd) * D
df['sales_impact'] = (df['retail_price'] - df['discounted_price']) * df['discount_percentage'] / 100
df[['retail_price', 'discounted_price', 'sales_impact']][:5]
Output :
retail discount sales_impact
0 999.0 379.0 384.784785
1 32157 22646.0 2813.046024
2 999.0 499.0 250.250250
3 699.0 267.0 266.987124
4 220.0 210.0 0.454545
df['sales_impact'].corr(df['discount_percentage'])
Output :
0.3167843236618487
Interpretation:
Visualize the impact of discounts on sales
plt.scatter(df['discount_percentage'], df['sales_impact'], alpha=0.5)
plt.title('Impact of Discounts on Sales')
plt.xlabel('Discount Percentage')
plt.ylabel('Estimated Sales Impact')
plt.grid(True)
plt.show()
Output :
Observation
Product Description Analysis
Analyze product descriptions for sales and ratings impact.
Word Frequency Analysis
Find common words in descriptions.
Text Processing
from collections import Counter
# nltk.download('stopwords')
# nltk.download('punkt')
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
df['description'].head(2)
Output :
0 Key Features of Alisha...
1 FabHomeDecor Fabric...
Name: description, dtype: object
descriptions = df['description'].fillna('').str.split().apply(lambda x: [word.lower() for word in x])
Sample Output :
0 [key, features, of, alisha...
1 [fabhomedecor, fabric...
Name: description, Length: 19922, dtype: object
descriptions_str = [" ".join(description) for description in descriptions]
text = " ".join(descriptions_str)
Tokenize the text into individual words
words = word_tokenize(text)
Remove stopwords
# Remove stopwords
stop_words = set(stopwords.words("english"))
filtered_words = [word for word in words if word.lower() not in stop_words]
Initialize a stemmer.
stemmer = PorterStemmer()
Stem the words.
stemmed_words = [stemmer.stem(word) for word in filtered_words]
Check some most common words.
Counter(stemmed_words).most_common()[:10]
Output :
[('.', 82646),
(',', 43309),
(':', 18682),
('rs', 15083),
("'s", 14538),
('-', 13099),
('product', 12014),
('buy', 11436),
('women', 10630),
('flipkart.com', 10117)]
More processing of words of descriptions
import string
To remove the common punctuation and numeric symbols from your list of words
# Remove common punctuation and numeric symbols
filtered_words = [word for word in stemmed_words if word not in string.punctuation and not word.isdigit()]
Counter(filtered_words).most_common()[:10]
Output :
[('rs', 15083),
("'s", 14538),
('product', 12014),
('buy', 11436),
('women', 10630),
('flipkart.com', 10117),
('onlin', 9895),
('price', 8944),
('free', 8500),
('day', 8338)]
More reduction of words
# Custom list of words to exclude
words_to_exclude = [ "'s", "rs", "na", "k", "g", "x", "''", "l", "’", "``", "w", "?", "n't", "etc", "...", "3.5", "0.0", "--", "u", "flipkart.com",'product', 'buy', 'onlin', 'price',]
Remove common punctuation and numeric symbols.
filtered_words = [word for word in filtered_words if word not in string.punctuation and not word.isdigit()]
Remove custom excluded words.
filtered_words = [word for word in filtered_words if word not in words_to_exclude]
Count the occurrences of each word.
word_freq = Counter(filtered_words)
Counter(filtered_words).most_common()[:10]
Output :
[('women', 10630),
('free', 8500),
('day', 8338),
('replac', 8008),
('genuin', 7979),
('deliveri', 7890),
('ship', 7852),
('cash', 7842),
('guarante', 7543),
('featur', 7489)]
# !pip install wordcloud
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Output :
Observation:
Sentiment Analysis on product descriptions
To identify the overall sentiment of the descriptions (positive, negative, or neutral)
Using pre-trained sentiment analysis models from textblob libraries
#pip install textblob
from textblob import TextBlob
sentiments = []
for description in filtered_words:
analysis = TextBlob(description)
if analysis.sentiment.polarity > 0:
sentiments.append("Positive")
elif analysis.sentiment.polarity<0:
sentiments.append("Negative")
else:
sentiments.append("Neutral")
Counter(sentiments)
Output :
Counter({'Neutral': 818069, 'Positive': 56489, 'Negative': 20754})
Visualizing Sentiment Distribution
# Count the number of each sentiment
sentiment_counts = {
"Positive": sentiments.count("Positive"),
"Negative": sentiments.count("Negative"),
"Neutral": sentiments.count("Neutral")
}
# Create a bar chart
plt.bar(sentiment_counts.keys(), sentiment_counts.values())
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.title("Sentiment Distribution of Product Descriptions")
plt.show()
Output :
Interpretation
Inferences and Conclusion from Flipkart Sales Analysis:
Inferences:
Conclusions:
Overall the analysis provides data-driven insights to optimize operations, marketing, promotions, and the customer experience. Leveraging these insights can help Flipkart achieve the goal of increasing sales and enhancing customer satisfaction.
Recommendations
Thank you,