Sentiment-Based Sales Optimization By NLP
VALARMATHI GANESSIN
Data Analyst| Microsoft PowerBI Data Analyst | Bachelors of Engineering in Computer Science| Maths Teacher
nbsp;
Sentiment-Based Sales Optimization
18.12.2024
Leveraging Customer Feedback to Enhance Satisfaction and Drive Sales?
VALARMATHI GANESSIN
ENVISION VIRTUE
Online Data Analyst Internship
December 2024 Project?
Sentiment-Based Sales Optimization
nbsp;Table of Contents
1.Introduction
In today's competitive e-commerce landscape, understanding customer sentiment and preferences is paramount for driving business success. Customer reviews provide a wealth of information that can be leveraged to gain insights into consumer satisfaction and behavior. The project titled "Sentiment-Based Sales Optimization" aims to analyze customer reviews of women's clothing from an e-commerce platform, utilizing data science and natural language processing techniques to extract actionable insights.
Project Goals and Objectives
Goals and Analysis Steps with NLP and Models
Trend Identification
Pattern Discovery
Demographic Insights
Sentiment Analysis
Feature Engineering
Predictive Modeling
Reporting Insights
Ethical Considerations
?Objectives
The primary objective of the project "Sentiment-Based Sales Optimization" is to analyze customer reviews of women's clothing from an e-commerce platform using data science and natural language processing (NLP) techniques. The goal is to extract actionable insights that can enhance customer satisfaction and drive sales. Specifically, the project aims to:
Purpose
?The purpose of this project is to leverage the rich data found in customer reviews to inform and improve e-commerce strategies. Customer reviews are a valuable source of feedback, providing direct insights into customer preferences, experiences, and satisfaction levels. By systematically analyzing these reviews, the project seeks to:
Without this analysis, critical insights from customer feedback would remain untapped, potentially leading to missed opportunities for improvement and growth. This project aims to bridge that gap, offering a data-driven approach to understanding and enhancing the customer experience in the retail sector.
3.Data Descriptionnbsp;nbsp;
?Data Source: ?
??????????Data Format: Womens_Clothing CSV files
????Attributes? used
????????????Personal Information:
???????????Feature Variables:??
???????????Target Variable:
????????????????????????
?????Dependencies used
nbsp;nbsp;nbsp;nbsp;Python Libraries
Additional Tools
Environment
Tools and Environment
Benefits of Using Google Colab for This Project
By leveraging these benefits, we were able to efficiently conduct our analysis, develop models, and document our findings.
?
Utilizing PyCharm for Sentiment-Based Sales Optimizationnbsp; Using Machine Learning and NLP
?PyCharm, a powerful Integrated Development Environment (IDE) for Python, significantly enhances development efficiency in projects like the "Sentiment-Based Sales Optimization"? Using Machine Learning and NLP. It provides a comprehensive suite of tools that streamline coding, testing, and debugging processes. PyCharm's intelligent code editor features code completion, real-time error detection, and code refactoring capabilities, which help developers write cleaner and more maintainable code. Additionally, its robust debugging tools allow for effective troubleshooting and performance optimization. The integrated version control system supports seamless collaboration, while the built-in support for various frameworks, including those used for machine learning and NLP, facilitates smoother project execution. By leveraging PyCharm's extensive functionality, developers can focus more on innovation and problem-solving, ultimately accelerating the development cycle and enhancing overall productivity.??????
4.Data Cleaning By Pandas Profiling
nbsp;Benefits of Using Pandas for Data Cleaning:
?
nbsp;Rationale for Choosing Pandas Profiling in Data Preprocessing
Pandas profiling helped? "Sentiment-Based Sales Optimization"? Using Machine Learning and NLP by providing a comprehensive report on the Womens_Clothing.csv file, offering insights into data quality, identifying missing values, and summarizing statistical properties, which streamlined the data cleaning process.
nbsp;nbsp;nbsp;1.Essentials: (Data type,unique values and missing values by python script).
Essentials Pandas Profiling.py
??????????#Python Script for finding the missing values
import pandas as pd
# Load the CSV file into a DataFrame
file_path = "Womens_Clothing.csv"
df = pd.read_csv(file_path)
# Find the data types of each column
data_types = df.dtypes
print("Data Types:\n", data_types)
# Find the number of unique values in each column
unique_values = df.nunique()
print("\nUnique Values:\n", unique_values)
# Find the number of missing values in each column
missing_values = df.isnull().sum()
print("\nMissing Values:\n", missing_values)
OUTPUT
Data Types:
Unnamed: 0? ? ? ? ? ? ? ? ? int64
Clothing ID ? ? ? ? ? ? ? ? int64
Age ? ? ? ? ? ? ? ? ? ? ? ? int64
Title? ? ? ? ? ? ? ? ? ? ? object
Review Text? ? ? ? ? ? ? ? object
Rating? ? ? ? ? ? ? ? ? ? ? int64
Recommended IND ? ? ? ? ? ? int64
Positive Feedback Count ? ? int64
Division Name? ? ? ? ? ? ? object
Department Name? ? ? ? ? ? object
Class Name ? ? ? ? ? ? ? ? object
dtype: object
Unique Values:
Unnamed: 0 ? ? ? ? ? ? ? ? 23486
Clothing ID ? ? ? ? ? ? ? ? 1206
Age ? ? ? ? ? ? ? ? ? ? ? ? ? 77
Title? ? ? ? ? ? ? ? ? ? ? 13993
Review Text? ? ? ? ? ? ? ? 22634
Rating ? ? ? ? ? ? ? ? ? ? ? ? 5
Recommended IND? ? ? ? ? ? ? ? 2
Positive Feedback Count ? ? ? 82
Division Name? ? ? ? ? ? ? ? ? 3
Department Name? ? ? ? ? ? ? ? 6
Class Name? ? ? ? ? ? ? ? ? ? 20
dtype: int64
Missing Values:
Unnamed: 0? ? ? ? ? ? ? ? ? ? 0
Clothing ID ? ? ? ? ? ? ? ? ? 0
Age ? ? ? ? ? ? ? ? ? ? ? ? ? 0
Title? ? ? ? ? ? ? ? ? ? ? 3810
Review Text ? ? ? ? ? ? ? ? 845
Rating? ? ? ? ? ? ? ? ? ? ? ? 0
Recommended IND ? ? ? ? ? ? ? 0
Positive Feedback Count ? ? ? 0
Division Name? ? ? ? ? ? ? ? 14
Department Name? ? ? ? ? ? ? 14
Class Name ? ? ? ? ? ? ? ? ? 14
dtype: int64
?
Process finished with exit code 0
Heat Map
Handling the missing values by Statistics.
Missing_Values_By_Statistics.py
import pandas as pd
import numpy as np
# Load the dataset
df = pd.read_csv('womensclothing.csv')
# Check for missing values
print("Missing values before handling:\n", df.isnull().sum())
# Handle missing values with statistical methods
df['Age'] = df['Age'].fillna(df['Age'].mean())
for col in ['Division Name', 'Department Name', 'Class Name']:
????df[col] = df[col].fillna(df[col].mode()[0])
df['Title'] = df['Title'].fillna(df['Title'].mode()[0])
df['Review Text'] = df['Review Text'].fillna(df['Review Text'].mode()[0])
# Verify the changes
print("Missing values after handling:\n", df.isnull().sum())
# Save the cleaned dataset (optional)
df.to_csv('womensclothing_cleaned.csv', index=False)
Output
Missing values before handling:
Unnamed: 0? ? ? ? ? ? ? ? ? ? 0
Clothing ID ? ? ? ? ? ? ? ? ? 0
Age ? ? ? ? ? ? ? ? ? ? ? ? ? 0
Title? ? ? ? ? ? ? ? ? ? ? 3810
Review Text ? ? ? ? ? ? ? ? 845
Rating? ? ? ? ? ? ? ? ? ? ? ? 0
Recommended IND ? ? ? ? ? ? ? 0
Positive Feedback Count ? ? ? 0
Division Name? ? ? ? ? ? ? ? 14
Department Name? ? ? ? ? ? ? 14
Class Name ? ? ? ? ? ? ? ? ? 14
dtype: int64
Missing values after handling:
Unnamed: 0 ? ? ? ? ? ? ? ? 0
Clothing ID? ? ? ? ? ? ? ? 0
Age? ? ? ? ? ? ? ? ? ? ? ? 0
Title? ? ? ? ? ? ? ? ? ? ? 0
Review Text? ? ? ? ? ? ? ? 0
Rating ? ? ? ? ? ? ? ? ? ? 0
Recommended IND? ? ? ? ? ? 0
Positive Feedback Count? ? 0
Division Name? ? ? ? ? ? ? 0
Department Name? ? ? ? ? ? 0
Class Name ? ? ? ? ? ? ? ? 0
dtype: int64
Process finished with exit code 0
nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;Justification for Using Statistical Methods
Why Not Simply Drop Missing Values?
Specific Choices in the Dataset
By choosing these statistical methods, we ensure a comprehensive, unbiased, and consistent approach to handling missing values, which is essential for accurate and reliable analysis in the project.
2.Quantile statistics ( minimum value, Q1, median, Q3, maximum, range, interquartile range)
??????????Quantile_Analysis.py
#Quantile Analysis
import pandas as pd
# Load the CSV file into a DataFrame
file_path = "Womens_Clothing.csv"
df = pd.read_csv(file_path)
# Function to calculate and display quantile statistics for all columns
def calculate_quantile_statistics(df):
????statistics = {}
????
????for column in df.columns:
????????if pd.api.types.is_numeric_dtype(df[column]):
????????????min_value = df[column].min()
????????????Q1 = df[column].quantile(0.25)
????????????median = df[column].median()
????????????Q3 = df[column].quantile(0.75)
????????????max_value = df[column].max()
????????????range_value = max_value - min_value
????????????IQR = Q3 - Q1
????????????
????????????statistics[column] = {
????????????????'Minimum Value': min_value,
????????????????'Q1 (25th percentile)': Q1,
????????????????'Median (50th percentile)': median,
????????????????'Q3 (75th percentile)': Q3,
????????????????'Maximum Value': max_value,
????????????????'Range': range_value,
????????????????'Interquartile Range (IQR)': IQR
????????????}
????
????return statistics
# Calculate and display the quantile statistics for all numeric columns
quantile_statistics = calculate_quantile_statistics(df)
for column, stats in quantile_statistics.items():
????print(f"\nColumn: {column}")
????for stat_name, value in stats.items():
????????print(f"{stat_name}: {value}")
# Optionally, you can save the statistics to a file
output_path = "Inter_Quantile_Output.csv"
pd.DataFrame(quantile_statistics).T.to_csv(output_path)
OUTPUT
?
Column: Unnamed: 0
Minimum Value: 0
Q1 (25th percentile): 5871.25
Median (50th percentile): 11742.5
Q3 (75th percentile): 17613.75
Maximum Value: 23485
Range: 23485
Inter-quantile Range (IQR): 11742.5
Column: Clothing ID
Minimum Value: 0
Q1 (25th percentile): 861.0
Median (50th percentile): 936.0
Q3 (75th percentile): 1078.0
Maximum Value: 1205
Range: 1205
Inter-quantile Range (IQR): 217.0
Column: Age
Minimum Value: 18
Q1 (25th percentile): 34.0
Median (50th percentile): 41.0
Q3 (75th percentile): 52.0
Maximum Value: 99
Range: 81
Inter-quantile Range (IQR): 18.0
Column: Rating
Minimum Value: 1
Q1 (25th percentile): 4.0
Median (50th percentile): 5.0
Q3 (75th percentile): 5.0
Maximum Value: 5
Range: 4
Inter-quantile Range (IQR): 1.0
Column: Recommended IND
Minimum Value: 0
Q1 (25th percentile): 1.0
Median (50th percentile): 1.0
Q3 (75th percentile): 1.0
Maximum Value: 1
Range: 1
Inter-quantile Range (IQR): 0.0
Column: Positive Feedback Count
Minimum Value: 0
Q1 (25th percentile): 0.0
Median (50th percentile): 1.0
Q3 (75th percentile): 3.0
Maximum Value: 122
Range: 122
Inter-quantile Range (IQR): 3.0
Process finished with exit code 0
???????????
Quantile Output saved as csv file
3.Descriptive statistics ( mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness)
?Descriptive_Analysis.py
??????????#Python Script for Descriptive Analysis
import pandas as pd
import numpy as np
from scipy.stats import kurtosis, skew
# Load the CSV file into a DataFrame
file_path = "Womens_Clothing.csv"
df = pd.read_csv(file_path)
# Function to calculate descriptive statistics
def calculate_descriptive_statistics(df):
????statistics = {}
????
????for column in df.columns:
????????if pd.api.types.is_numeric_dtype(df[column]):
????????????mean_value = df[column].mean()
????????????mode_value = df[column].mode().iloc[0] if not df[column].mode().empty else np.nan
????????????std_dev = df[column].std()
????????????sum_value = df[column].sum()
????????????mad = df[column].mad()
????????????coeff_var = std_dev / mean_value if mean_value != 0 else np.nan
????????????kurtosis_value = kurtosis(df[column], nan_policy='omit')
????????????skewness_value = skew(df[column], nan_policy='omit')
????????????statistics[column] = {
????????????????'Mean': mean_value,
????????????????'Mode': mode_value,
????????????????'Standard Deviation': std_dev,
????????????????'Sum': sum_value,
????????????????'Median Absolute Deviation': mad,
????????????????'Coefficient of Variation': coeff_var,
????????????????'Kurtosis': kurtosis_value,
????????????????'Skewness': skewness_value
????????????}
????
????return statistics
# Calculate and display the descriptive statistics for all numeric columns
descriptive_statistics = calculate_descriptive_statistics(df)
for column, stats in descriptive_statistics.items():
????print(f"\nColumn: {column}")
????for stat_name, value in stats.items():
????????print(f"{stat_name}: {value}")
# Optionally, you can save the statistics to a file
output_path = "Descriptive_Analysis_Output.csv"
pd.DataFrame(descriptive_statistics).T.to_csv(output_path)
OUTPUT
?
Column: Unnamed: 0
Mean: 11742.5
Mode: 0
Standard Deviation: 6779.968547124684
Sum: 275784355
Median Absolute Deviation: 5871.5
Coefficient of Variation: 0.5773871447412974
Kurtosis: -1.2000000043510406
Skewness: 0.0
Column: Clothing ID
Mean: 918.1187090181385
Mode: 1078
Standard Deviation: 203.2989797220474
Sum: 21562936
Median Absolute Deviation: 107.0
Coefficient of Variation: 0.22142994988029482
Kurtosis: 5.180911656457766
Skewness: -2.087502669232274
Column: Age
Mean: 43.198543813335604
Mode: 39
Standard Deviation: 12.279543615591493
Sum: 1014561
Median Absolute Deviation: 8.0
Coefficient of Variation: 0.2842582765903497
Kurtosis: -0.11205236979496735
Skewness: 0.5255809358950211
Column: Rating
Mean: 4.196031678446734
Mode: 5
Standard Deviation: 1.1100307198243897
Sum: 98548
Median Absolute Deviation: 0.0
Coefficient of Variation: 0.2645429789117549
Kurtosis: 0.8037092866669644
Skewness: -1.3134452114806563
Column: Recommended IND
Mean: 0.8223622583666865
Mode: 1
Standard Deviation: 0.38221563891455684
Sum: 19314
Median Absolute Deviation: 0.0
Coefficient of Variation: 0.4647776998833635
Kurtosis: 0.8454434366260339
Skewness: -1.6868442241730668
Column: Positive Feedback Count
Mean: 2.535936302478072
Mode: 0
Standard Deviation: 5.7022015020339385
Sum: 59559
Median Absolute Deviation: 1.0
Coefficient of Variation: 2.248558647337415
Kurtosis: 71.67766116867968
Skewness: 6.472584305808119
Process finished with exit code 0
?
?
Descriptive Analysis Output saved as a csv file
4.Most frequent values
?Most_Frequent_Values.py
#Python Script for finding most frequent values in the given csv file
import pandas as pd
# Load the CSV file into a DataFrame
file_path = "Womens_Clothing.csv"
df = pd.read_csv(file_path)
# Function to find most frequent values
def most_frequent_values(df):
????most_frequent = {}
????
????for column in df.columns:
????????most_frequent_value = df[column].mode().iloc[0] if not df[column].mode().empty else None
????????most_frequent[column] = most_frequent_value
????
????return most_frequent
# Calculate and display the most frequent values for each column
most_frequent = most_frequent_values(df)
for column, value in most_frequent.items():
????print(f"Column: {column}, Most Frequent Value: {value}")
# Optionally, you can save the results to a CSV file
output_path = "Most_Frequent_values_Output.csv"
pd.DataFrame(most_frequent.items(), columns=['Column', 'Most Frequent Value']).to_csv(output_path, index=False)
OUTPUT
Column: Unnamed: 0, Most Frequent Value: 0
Column: Clothing ID, Most Frequent Value: 1078
Column: Age, Most Frequent Value: 39
Column: Title, Most Frequent Value: Love it!
Column: Review Text, Most Frequent Value: Perfect fit and i've gotten ? so many compliments. i buy all my suits from here now!
Column: Rating, Most Frequent Value: 5
Column: Recommended IND, Most Frequent Value: 1
Column: Positive Feedback Count, Most Frequent Value: 0
Column: Division Name, Most Frequent Value: General
Column: Department Name, Most Frequent Value: Tops
Column: Class Name, Most Frequent Value: Dresses
Process finished with exit code 0
Most Frequent Values saved as a csv file
5.Histogram
#Python Script for Histogram Graph
import pandas as pd
import matplotlib.pyplot as plt
# Load the CSV file into a DataFrame
file_path = "Womens_Clothing.csv"
df = pd.read_csv(file_path)
# Plot histograms for all numeric columns
df.hist(figsize=(10, 8), bins=30, edgecolor='black')
# Set overall title for the histograms
plt.suptitle('Histograms of Numeric Columns', fontsize=16)
# Display the plot
plt.show()
Output
6.Correlations nbsp;(highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices)
# Import the necessary library
import pandas as pd
# Define the file path to the dataset
file_path = "Womens_Clothing.csv"
# Load the dataset into a pandas DataFrame
df = pd.read_csv(file_path)
# Select only the numeric columns from the DataFrame
numeric_df = df.select_dtypes(include='number')
# Calculate the correlation matrices using different methods
pearson_corr = numeric_df.corr(method='pearson')? # Pearson correlation
spearman_corr = numeric_df.corr(method='spearman')? # Spearman correlation
kendall_corr = numeric_df.corr(method='kendall')? # Kendall correlation
# Function to highlight highly correlated values in the correlation matrix
def highlight_highly_correlated(corr_matrix, threshold=0.8):
????return corr_matrix.applymap(lambda x: 'background-color: yellow' if abs(x) >= threshold else '')
# Apply the highlighting function to the correlation matrices
highlighted_pearson = pearson_corr.style.apply(highlight_highly_correlated, threshold=0.8, axis=None)
highlighted_spearman = spearman_corr.style.apply(highlight_highly_correlated, threshold=0.8, axis=None)
highlighted_kendall = kendall_corr.style.apply(highlight_highly_correlated, threshold=0.8, axis=None)
# Save the correlation matrices to CSV files
pearson_corr.to_csv("pearson_correlation.csv")
spearman_corr.to_csv("spearman_correlation.csv")
kendall_corr.to_csv("kendall_correlation.csv")
# Save the highlighted correlation matrices to HTML files
highlighted_pearson.to_html("highlighted_pearson_correlation.html")
highlighted_spearman.to_html("highlighted_spearman_correlation.html")
highlighted_kendall.to_html("highlighted_kendall_correlation.html")
# Print a success message
print("Correlation matrices and highlighted versions saved successfully.")
OUTPUT?
All the three correlations are saved as csv as well as html files.
Correlation matrices and highlighted versions saved successfully.
Process finished with exit code 0
7.Missing valuesnbsp; nbsp; (matrix, count, heatmap and dendrogram of missing values)
Missing_Values.py
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
# Define the file path to the dataset
file_path = "Womens_Clothing.csv"
# Load the dataset into a pandas DataFrame
df = pd.read_csv(file_path)
# Calculate the count of missing values in each column
missing_values_count = df.isnull().sum()
# Print the count of missing values for each column
print("Missing Values Count:\n", missing_values_count)
# Plot a heatmap to visualize the locations of missing values
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Heatmap of Missing Values')
plt.show()
# Plot a matrix to visualize the missing values using missingno
msno.matrix(df, figsize=(12, 6))
plt.title('Missing Values Matrix')
plt.show()
# Plot a dendrogram to visualize the hierarchical clustering of the missing values
msno.dendrogram(df)
plt.title('Dendrogram of Missing Values')
plt.show()
# Save the count of missing values to a CSV file
missing_values_count.to_csv("missing_value_matrix_output.csv", header=["Missing Values"])
# Print a success message
print("Missing values analysis completed and visualized successfully.")
Output
Heat Map for Missing Values
?
??
Missing Values Count:
Unnamed: 0 ? ? ? ? ? ? ? ? 0
Clothing ID? ? ? ? ? ? ? ? 0
Age? ? ? ? ? ? ? ? ? ? ? ? 0
Title? ? ? ? ? ? ? ? ? ? ? 0
Review Text? ? ? ? ? ? ? ? 0
Rating ? ? ? ? ? ? ? ? ? ? 0
Recommended IND? ? ? ? ? ? 0
Positive Feedback Count? ? 0
Division Name? ? ? ? ? ? ? 0
Department Name? ? ? ? ? ? 0
Class Name ? ? ? ? ? ? ? ? 0
dtype: int64
?
5.Data Modelling
Calculatednbsp; Measures
??Calculated Measures used in this project:
1.Average Age = AVERAGE(Fact_Womens_Clothing[Age])
2.Average Age by Department =
AVERAGEX(
????SUMMARIZE(
????????'Fact_Womens_Clothing',
????????'Fact_Womens_Clothing'[Department Name],
????????"AvgAge", AVERAGE('Fact_Womens_Clothing'[Age])
????),
????[AvgAge]
)
3.Average Rating by Department =
AVERAGEX(
????SUMMARIZE(
????????'Fact_Womens_Clothing',
????????'Fact_Womens_Clothing'[Department Name],
????????"AvgRating", AVERAGE('Fact_Womens_Clothing'[Rating])
????),
????[AvgRating]
)
4.Average Rating by Division =
AVERAGEX(
????SUMMARIZE('Fact_Womens_Clothing', 'Fact_Womens_Clothing'[Division Name], "AvgRating", AVERAGE('Fact_Womens_Clothing'[Rating])),
????[AvgRating]
)
5.Average Review Length = AVERAGE(Fact_Womens_Clothing[Review Length])
6.Avg Recommendation Rate =
DIVIDE(
????COUNTAX(
????????FILTER('Fact_Womens_Clothing', 'Fact_Womens_Clothing'[Recommended IND] = 1),
????????'Fact_Womens_Clothing'[Review Text]
????),
????COUNT('Fact_Womens_Clothing'[Review Text]),
????0
)
7.Customer Satisfaction Score = AVERAGE('Fact_Womens_Clothing'[Rating])
8.Max = 1
9.Negative Sentiment = CALCULATE(COUNTROWS('Fact_Womens_Clothing'), 'Fact_Womens_Clothing'[Sentiment] = "Negative")
????)
?
12.Negative Sentiment Count by Department = CALCULATE(COUNTROWS('Fact_Womens_Clothing'), 'Fact_Womens_Clothing'[Sentiment Category] = "Negative")
13.Negative Sentiment Rate = DIVIDE([Negative Sentiment], [Total Reviews], 0)
14.Negative Sentiment Rate by Department = DIVIDE([Negative Sentiment Count by Department], [Total Reviews by Department], 0)
15.Neutral Sentiment = CALCULATE(COUNTROWS('Fact_Womens_Clothing'), 'Fact_Womens_Clothing'[Sentiment] = "Neutral")
?
16.Positive Feedback Count by Department =
COUNTROWS(
????FILTER(
????????'Fact_Womens_Clothing',
????????'Fact_Womens_Clothing'[Positive Feedback Count] > 0
????)
)
?
17.Positive Feedback Rate =
DIVIDE (
????COUNTROWS(FILTER('Fact_Womens_Clothing', 'Fact_Womens_Clothing'[Positive Feedback Count] > 0)),
????COUNTROWS('Fact_Womens_Clothing'),
????0
)
18.Positive Feedback Rate by Department =
DIVIDE(
????[Positive Feedback Count by Department],
????[Total Reviews by Department],
????0
)
19.Positive Sentiment = CALCULATE(COUNTROWS('Fact_Womens_Clothing'), 'Fact_Womens_Clothing'[Sentiment] = "Positive")
?
22.Positive Sentiment Count by Department = CALCULATE(COUNTROWS('Fact_Womens_Clothing'), 'Fact_Womens_Clothing'[Sentiment Category] = "Positive")
?
23.Positive Sentiment Rate = DIVIDE([Positive Sentiment], [Total Reviews], 0)
24.Positive Sentiment Rate by Department = DIVIDE([Positive Sentiment Count by Department], [Total Reviews by Department], 0)
25.Recommendation Rate =
DIVIDE (
????COUNTROWS(FILTER('Fact_Womens_Clothing', 'Fact_Womens_Clothing'[Recommended IND] = 1)),
????COUNTROWS('Fact_Womens_Clothing'),
????0
)
26.Reviews by Division = COUNTROWS('Fact_Womens_Clothing')
27.Top Reviewed Division =
VAR DivisionReviewCounts =
????SUMMARIZE(
????????'Fact_Womens_Clothing',
????????'Fact_Womens_Clothing'[Division Name],
????????"Review Count", COUNTROWS('Fact_Womens_Clothing')
????)
VAR TopDivision =
????TOPN(
????????1,
????????DivisionReviewCounts,
????????[Review Count],
????????DESC
????)
VAR TopDivisionName =
????MAXX(TopDivision, 'Fact_Womens_Clothing'[Division Name])
RETURN
????TopDivisionName
28.Total Reviews = COUNTROWS('Fact_Womens_Clothing')
29.Total Reviews by Department =
COUNTROWS('Fact_Womens_Clothing')
Calculated Columns
1.Age Group =
SWITCH (
????TRUE(),
????'Fact_Womens_Clothing'[Age] < 20, "Under 20",
????'Fact_Womens_Clothing'[Age] < 30, "20-29",
????'Fact_Womens_Clothing'[Age] < 40, "30-39",
????'Fact_Womens_Clothing'[Age] < 50, "40-49",
????'Fact_Womens_Clothing'[Age] < 60, "50-59",
????'Fact_Womens_Clothing'[Age] >= 60, "60+",
????"Unknown"
)
2.Negative_Reviews = IF(CONTAINSSTRING([Cleaned Review Text], "negative"), 1, 0)
3.Positive_Reviews = IF(CONTAINSSTRING([Cleaned Review Text], "positive"), 1, 0)
4.Review Length = LEN('Fact_Womens_Clothing'[Cleaned Review Text])
5.Sentiment Category =
SWITCH (
????TRUE(),
????[Sentiment Score] > 0, "Positive",
????[Sentiment Score] < 0, "Negative",
????[Sentiment Score] = 0, "Neutral",
????BLANK()
)
6.Word Count = LEN([Cleaned Review Text]) - LEN(SUBSTITUTE([Cleaned Review Text], " ", "")) + 1
Correlation Variable?
1.11.Negative Sentiment and Sentiment Score correlation for Department Name =
VAR __CORRELATION_TABLE = VALUES('Fact_Womens_Clothing'[Department Name])
VAR __COUNT =
????COUNTX(
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE(
????????????[Negative Sentiment]
????????????????* SUM('Fact_Womens_Clothing'[Sentiment Score])
????????)
????)
VAR __SUM_X =
????SUMX(
领英推荐
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE([Negative Sentiment])
????)
VAR __SUM_Y =
????SUMX(
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE(SUM('Fact_Womens_Clothing'[Sentiment Score]))
????)
VAR __SUM_XY =
????SUMX(
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE(
????????????[Negative Sentiment]
???????????????? SUM('Fact_Womens_Clothing'[Sentiment Score]) 1.
????????)
????)
VAR __SUM_X2 =
????SUMX(
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE([Negative Sentiment] ^ 2)
????)
VAR __SUM_Y2 =
????SUMX(
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE(SUM('Fact_Womens_Clothing'[Sentiment Score]) ^ 2)
????)
RETURN
????DIVIDE(
????????COUNT SUM_XY - __SUM_X __SUM_Y * 1.,
????????SQRT(
????????????(__COUNT * __SUM_X2 - __SUM_X ^ 2)
???????????????? (__COUNT __SUM_Y2 - __SUM_Y ^ 2)
????????)
????)
2.21.Positive Sentiment and Sentiment Score correlation for Department Name =
VAR __CORRELATION_TABLE = VALUES('Fact_Womens_Clothing'[Department Name])
VAR __COUNT =
????COUNTX(
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE(
????????????[Positive Sentiment]
????????????????* SUM('Fact_Womens_Clothing'[Sentiment Score])
????????)
????)
VAR __SUM_X =
????SUMX(
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE([Positive Sentiment])
????)
VAR __SUM_Y =
????SUMX(
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE(SUM('Fact_Womens_Clothing'[Sentiment Score]))
????)
VAR __SUM_XY =
????SUMX(
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE(
????????????[Positive Sentiment]
???????????????? SUM('Fact_Womens_Clothing'[Sentiment Score]) 1.
????????)
????)
VAR __SUM_X2 =
????SUMX(
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE([Positive Sentiment] ^ 2)
????)
VAR __SUM_Y2 =
????SUMX(
????????KEEPFILTERS(__CORRELATION_TABLE),
????????CALCULATE(SUM('Fact_Womens_Clothing'[Sentiment Score]) ^ 2)
????)
RETURN
????DIVIDE(
????????COUNT SUM_XY - __SUM_X __SUM_Y * 1.,
????????SQRT(
????????????(__COUNT * __SUM_X2 - __SUM_X ^ 2)
???????????????? (__COUNT __SUM_Y2 - __SUM_Y ^ 2)
????????)
????)
?
?
?
?
Data Modelling(Snowflake Model)
Dimension Tables
1.Dim_Classes
2.Dim_Department
3.Dim_Division
Fact Table
Fact_WomensClothing
6.Exploratory Data Analysis (EDA)?
Page 1: Sales Performance Home Page
Slicers:
Key Insights from Visualizations
Slicers for Age, Division Name, Department Name, and Class Name
Purpose:
Insight:
Table for Top 3 Departments with High Positive Feedback Rating
Purpose:
Insight:
Cards for Customer Satisfaction Score, Average Age, Positive Feedback Rate, Total Reviews, Neutral Sentiment, and Top Reviewed Division
Purpose:
Insights:
Pie Chart for Sentiment Distribution
Purpose:
Insight:
Column Chart for Reviews by Division Names
Purpose:
Insight:
Overall Key Insights
Page 2: Customer Satisfaction Analysis(Rating Analysis)
Correlation Matrix: Age Group Performance and Division Rating
Purpose: To analyze the relationship between different age groups and the ratings given to various divisions. This helps in understanding how different age groups perceive products in different divisions.
Visualization: Correlation Matrix
Description:
Key Insights:
Clustered Bar Chart: Average Rating by Sentiment Category
Purpose: To visualize the average rating provided by customers, categorized by their sentiment (positive, neutral, negative). This helps to understand how sentiments affect ratings.
Visualization: Clustered Bar Chart
Description:
Key Insights:
Bar Chart: Average Rating by Rating Values
Purpose: To display the distribution of average ratings given by customers, categorized by rating values (1-5 stars). This helps in identifying the most common rating values and overall customer satisfaction.
Visualization: Bar Chart
Description:
Key Insights:
Summary of Insights and Recommendations
Overall Insights:
Recommendations:
?
Page 3: Sentiment Analysis
?Page 3: Sentiment Analysis
Key Gauge: Correlation Coefficient
Purpose: To display the correlation coefficient between positive sentiment, sentiment score, and department category, highlighting the strength of the relationship.
Visualization: Key Gauge
Description:
Key Insights:
Scatter Plot: Sentiment Score vs. Positive Sentiment by Department Category
Purpose: To visualize the relationship between sentiment score and positive sentiment across different department categories.
Visualization: Scatter Plot
Description:
Key Insights:
Scatter Plot: Sentiment Score vs. Negative Sentiment by Department Category
Purpose: To analyze the relationship between sentiment score and negative sentiment across different department categories.
Visualization: Scatter Plot
Description:
Key Insights:
Key Gauge: Correlation Coefficient by Negative Sentiment, Sentiment Score, and Department Category
Purpose: To display the correlation coefficient between negative sentiment, sentiment score, and department category.
Visualization: Key Gauge
Description:
Key Insights:
Summary of Insights and Recommendations
Overall Insights:
Recommendations:
Page 4: Demographic Analysis
Pie Chart: Sentiment Score Distribution
Purpose: To visualize the distribution of sentiment scores across all reviews, providing insights into the overall emotional tone of customer feedback.
Visualization: Pie Chart
Description:
Key Insights:
2. Pie Chart: Review Length Distribution
Purpose: To show the distribution of review length by Age group? helping to understand how detailed the feedback is from different age ranges.
Visualization: Pie Chart
Description:
Key Insights:
Pie Chart: Positive Feedback Distribution
Purpose: To visualize the distribution of positive feedback across different Age Ranges, showing which Age Range is giving more Positive feedback.
Visualization: Pie Chart
Description:
Key Insights:
Clustered Column Chart: Reviews by Age Group Category by Recommended IND
Purpose: To analyze the distribution of reviews across different age groups, categorized by whether the product was recommended or not.
Visualization: Clustered Column Chart
Description:
Key Insights:
Clustered Column Chart: Age Distribution Categorized by Sentiment Category
Purpose: To visualize the distribution of age groups across different sentiment categories (positive, neutral, negative).
Visualization: Clustered Column Chart
Description:
Key Insights:
Summary of Insights and Recommendations
Overall Insights:
Recommendations:
This documentation provides a comprehensive view of the demographic analysis, highlighting key insights and actionable recommendations based on customer feedback and sentiment analysis.?
Page 5:nbsp; Detailed Review Analysis
Clustered Column Chart: Average Rating by Division Categorized by Sentiment Category
Purpose: To analyze the average rating given to each division, categorized by sentiment (positive, neutral, negative). This helps in understanding the quality perception of each division.
Visualization: Clustered Column Chart
Description:
Key Insights:
Clustered Column Chart: Sum of Reviews by Division Categorized by Sentiment Category
Purpose: To visualize the total number of reviews received by each division, categorized by sentiment (positive, neutral, negative). This helps in understanding the volume and nature of feedback across different divisions.
Visualization: Clustered Column Chart
Description:
Key Insights:
Summary of Insights and Recommendations
Overall Insights:
Recommendations:
This documentation provides a detailed analysis of customer reviews and sentiment across different divisions, offering valuable insights and actionable recommendations for improving product quality and customer satisfaction.??
.
Page 6:nbsp; Review Text Analysis Word Cloudnbsp;
Word Cloud: 100 Frequent Words from Positive Feedback
Purpose: To visualize the most frequently used words in positive feedback, providing insights into what customers appreciate the most.
Visualization: Word Cloud
Description:
Key Insights:
Word Cloud: 100 Frequent Words from Review Text
Purpose: To provide a general overview of the most frequently used words across all review texts, giving a holistic view of customer feedback.
Visualization: Word Cloud
Description:
Key Insights:
Word Cloud: 100 Frequent Words from Negative Feedback
Purpose: To visualize the most frequently used words in negative feedback, providing insights into areas where customers are dissatisfied.
Visualization: Word Cloud
Description:
Key Insights:
Summary of Insights and Recommendations
Overall Insights:
Recommendations:
This page provides a thorough analysis of review texts using word clouds, highlighting key themes and actionable insights to improve customer satisfaction and product quality.?
Page 7.Key Insight Demographic Analysis
nbsp;
Purpose
The Key Insight Drillthrough Page allows for an in-depth analysis of customer satisfaction and recommendations based on age group and the recommended indicator. This page enables users to drill down into specific data segments for a clearer understanding of how different age groups perceive and recommend products.
?
Page 8. Key Findings
Summary from PowerBI Report?
These points highlight the comprehensive analysis and valuable insights derived from the Power BI report, helping to inform strategic decisions for enhancing customer satisfaction and driving business success.?
7.LOGISTIC REGRESSION
Justification for Logistic Regression
??
1. Simplicity and Interpretability
2. Binary Classification Suitability
3. Computational Efficiency
4. Less Prone to Overfitting
5. Baseline Performance
Comparing with Other Algorithms
Decision Trees
Random Forest
Support Vector Machines (SVM)
Neural Networks
Project Attributes and Their Correlation
Model Performance Metrics
Logistic Regression step by step Algorithm used in this Project.
1. Import Necessary Libraries
2. Load the Data
3. Create New Feature Columns
4. Select Features and Target Variable
5. One-Hot Encode Categorical Features
6. Check for Missing Values
7. Split Data into Training and Testing Sets
8. Set Up Logistic Regression with Grid Search
9. Train the Model
10. Make Predictions on the Test Data
11. Evaluate Model Performance
12. Generate and Plot a Confusion Matrix
13. Generate a Classification Report
14. Create a Bar Plot for TP, TN, FP, and FN
15. Combine Test Data with Predictions
16. Filter and Display Recommended Products
17. Print Unique Class and Division Names
18. Print Division-Class Mapping Table
By following these steps, the script successfully processes, analyzes, and predicts product recommendations based on various features, while providing valuable visual insights and detailed performance metrics.
Conclusion
Logistic regression is chosen for the "Sentiment-Based Sales Analysis" project due to its simplicity, interpretability, suitability for binary classification, computational efficiency, and strong baseline performance. It provides a clear starting point for analysis and can be easily explained to stakeholders, ensuring that the insights derived are both actionable and understandable.
Data Preprocessing Data Annotation By NLP?
Steps for Text Preprocessing and Analysis
Significance in NLP Context
These steps prepare the review text data effectively, making it suitable for further analysis and modeling in the "Sentiment-Based Sales Optimization" project.
DataPreprocessing.pyimport pandas as pd
#Data Preprossesing removing special characters, stop words, and performing Tokenization? and use sentiment analysis techniques to categorize reviews as positive, neutral, or negative.
import re
from textblob import TextBlob
# Load the CSV file into a DataFrame
file_path = r"Womens_Clothing_Statistics_Updataed1.csv"
df = pd.read_csv(file_path)
# Define stop words
stop_words = set([
????'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves',
????'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their',
????'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was',
????'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',
????'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between',
????'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off',
????'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both',
????'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
????'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'
])
# Function to clean review text
def clean_review_text(text):
????# Remove special characters and numbers
????text = re.sub(r'[^a-zA-Z\s]', '', text)
????# Convert to lowercase and split into words
????words = text.lower().split()
????# Remove stop words
????words = [word for word in words if word not in stop_words]
????# Join the words back into a single string
????cleaned_text = ' '.join(words)
????return cleaned_text
# Clean the Review Text column
df['Cleaned Review Text'] = df['Review Text'].apply(lambda x: clean_review_text(str(x)))
# Perform sentiment analysis on the cleaned review text
df['Sentiment Score'] = df['Cleaned Review Text'].apply(lambda x: TextBlob(x).sentiment.polarity)
# Categorize sentiment as positive, neutral, or negative
def categorize_sentiment(score):
????if score > 0:
????????return 'Positive'
????elif score < 0:
????????return 'Negative'
????else:
????????return 'Neutral'
df['Sentiment'] = df['Sentiment Score'].apply(lambda x: categorize_sentiment(x))
# Save the output to a new CSV file
output_file_path = r"Womens_Clothing_Cleaned_with_Sentiment.csv"
df.to_csv(output_file_path, index=False)
print(f"Processed data saved to {output_file_path}")
Output
Processed data saved to? Womens_Clothing_Cleaned.csv
Process finished with exit code 0
Womens_Clothing_Cleaned.csv
Benefits of NLP in Sentiment-Based Sales Optimization
NLP plays a pivotal role in sentiment-based sales optimization by turning unstructured text data into meaningful insights. This empowers businesses to enhance customer satisfaction, refine product offerings, and ultimately drive sales growth.
Logistic Regression.py
#Logistic Regression to find the recommended products
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob
# Load the CSV file into a DataFrame
file_path = r"Womens_Clothing_Statistics_Cleaned_Sentiment.csv"
df = pd.read_csv(file_path)
# Create new feature columns
df['Word Count'] = df['Review Text'].apply(lambda x: len(str(x).split()))
df['Sentiment Score'] = df['Review Text'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)
# Select features and target variable
features = df[['Clothing ID', 'Age', 'Rating', 'Division Name', 'Class Name', 'Sentiment Score', 'Word Count', 'Department Name']]
target = df['Recommended IND']
# One-hot encode categorical features
features = pd.get_dummies(features, columns=['Division Name', 'Department Name', 'Class Name'], drop_first=True)
# Check for missing values
print("Missing values in features:")
print(features.isnull().sum())
print("\nMissing values in target:")
print(target.isnull().sum())
# Split data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=42)
# Show data split information
print(f"\nTraining data size: {X_train.shape[0]} samples")
print(f"Testing data size: {X_test.shape[0]} samples")
# Set up logistic regression with grid search for hyperparameter tuning
model = LogisticRegression(solver='liblinear')
param_grid = {
????'C': [0.1, 1, 10],? # Reduced grid for faster processing
????'penalty': ['l1', 'l2']
}
grid_search = GridSearchCV(model, param_grid, cv=3, scoring='f1')? # Reduced number of CV folds
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
# Predict
y_pred = best_model.predict(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plotting the confusion matrix with proper alignment, color range legend, and annotation
plt.figure(figsize=(8, 6))
ax = sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Recommended', 'Recommended'], yticklabels=['Not Recommended', 'Recommended'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
# Add labels for TP, TN, FP, FN
labels = [['True Negative (TN)', 'False Positive (FP)'], ['False Negative (FN)', 'True Positive (TP)']]
for i in range(2):
????for j in range(2):
????????ax.text(j + 0.5, i + 0.5, f'{labels[i][j]}\n{cm[i][j]}',
?????????????????horizontalalignment='center',
?????????????????verticalalignment='center',
?????????????????color='white',
?????????????????fontsize=12,
?????????????????bbox=dict(facecolor='black', alpha=0.8, edgecolor='none', pad=1))
# Adding a color bar legend
cbar = ax.collections[0].colorbar
cbar.set_label('Count')
plt.show()
# Classification report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)
# Bar plot for TP, TN, FP, FN
cm_labels = ['True Negatives (TN)', 'False Positives (FP)', 'False Negatives (FN)', 'True Positives (TP)']
cm_counts = [cm[0][0], cm[0][1], cm[1][0], cm[1][1]]
# Creating a DataFrame for better handling
df_cm = pd.DataFrame({'Prediction Type': cm_labels, 'Count': cm_counts})
plt.figure(figsize=(10, 6))
barplot = sns.barplot(x='Prediction Type', y='Count', data=df_cm, palette='viridis')
# Adding data labels
for bar in barplot.patches:
????barplot.annotate(format(bar.get_height(), '.2f'),
?????????????????????(bar.get_x() + bar.get_width() / 2, bar.get_height()),
?????????????????????ha='center', va='center',
?????????????????????size=12, xytext=(0, 8),
?????????????????????textcoords='offset points')
plt.title('Counts of True Positives, True Negatives, False Positives, and False Negatives')
plt.ylabel('Count')
plt.xlabel('Prediction Type')
plt.xticks(rotation=45)
# Manually adding the legend handles and labels
handles = [plt.Line2D([0], [0], color=color, lw=4) for color in sns.color_palette("viridis", n_colors=4)]
labels = cm_labels
plt.legend(handles, labels, title='Prediction Type', loc='upper left')
plt.show()
# Combine original test data with predictions
X_test_with_predictions = X_test.copy()
X_test_with_predictions['Actual'] = y_test
X_test_with_predictions['Predicted'] = y_pred
# Add back the original Division Name and Class Name for display purposes
original_data = df[['Clothing ID', 'Division Name', 'Class Name']]
X_test_with_predictions = X_test_with_predictions.merge(original_data, left_on=X_test.index, right_index=True, how='left')
# Filter for recommended products (Predicted == 1)
recommended_products = X_test_with_predictions[X_test_with_predictions['Predicted'] == 1]
# Select columns to display
recommended_products_display = recommended_products[[ 'Age', 'Rating', 'Division Name', 'Class Name']]
# Print details of recommended products
print("\nRecommended Products:")
print(recommended_products_display.to_markdown())
# Print unique class names and division names
unique_class_names = df['Class Name'].unique()
unique_division_names = df['Division Name'].unique()
# Print Division Name: Class Name list
division_class_mapping = df.groupby('Division Name')['Class Name'].unique().to_dict()
print("\nDivision Name: Class Name List")
for division, classes in division_class_mapping.items():
????print(f"{division}: {', '.join(classes)}")
OUTPUT
?
Missing values in features:
Clothing ID ? ? ? ? ? ? ? ? ? ? 0
Age ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0
Rating? ? ? ? ? ? ? ? ? ? ? ? ? 0
Sentiment Score ? ? ? ? ? ? ? ? 0
Word Count? ? ? ? ? ? ? ? ? ? ? 0
Division Name_General Petite? ? 0
Division Name_Initmates ? ? ? ? 0
Department Name_Dresses ? ? ? ? 0
Department Name_Intimate? ? ? ? 0
Department Name_Jackets ? ? ? ? 0
Department Name_Tops? ? ? ? ? ? 0
Department Name_Trend ? ? ? ? ? 0
Class Name_Casual bottoms ? ? ? 0
Class Name_Chemises ? ? ? ? ? ? 0
Class Name_Dresses? ? ? ? ? ? ? 0
Class Name_Fine gauge ? ? ? ? ? 0
Class Name_Intimates? ? ? ? ? ? 0
Class Name_Jackets? ? ? ? ? ? ? 0
Class Name_Jeans? ? ? ? ? ? ? ? 0
Class Name_Knits? ? ? ? ? ? ? ? 0
Class Name_Layering ? ? ? ? ? ? 0
Class Name_Legwear? ? ? ? ? ? ? 0
Class Name_Lounge ? ? ? ? ? ? ? 0
Class Name_Outerwear? ? ? ? ? ? 0
Class Name_Pants? ? ? ? ? ? ? ? 0
Class Name_Shorts ? ? ? ? ? ? ? 0
Class Name_Skirts ? ? ? ? ? ? ? 0
Class Name_Sleep? ? ? ? ? ? ? ? 0
Class Name_Sweaters ? ? ? ? ? ? 0
Class Name_Swim ? ? ? ? ? ? ? ? 0
Class Name_Trend? ? ? ? ? ? ? ? 0
dtype: int64
Explanation:
This part confirms that there are no missing values in the features used for the model. Each feature has zero missing values, indicating data completeness.
Missing values in target:
0
Explanation:
There are no missing values in the target variable (Recommended IND), which is necessary for accurate model training and evaluation.
Training data size: 17614 samples
Testing data size: 5872 samples
Explanation:
The dataset is split into 75% for training (17,614 samples) and 25% for testing (5,872 samples). This helps to evaluate the model on unseen data.
Best Parameters: {'C': 10, 'penalty': 'l1'}
The best model parameters identified by Grid Search are C=10 (indicating moderate regularization) and penalty='l1' (L1 regularization, which can result in sparse models).
Accuracy: 0.9303474114441417
Precision: 0.9676802041250265
Recall: 0.9465474209650583
F1 Score: 0.9569971611817895
Explanation:
Classification Report:
???????????????precision? ? recall? f1-score ? support
???????????0 ? ? ? 0.78? ? ? 0.86? ? ? 0.82? ? ? 1064
???????????1 ? ? ? 0.97? ? ? 0.95? ? ? 0.96? ? ? 4808
????accuracy ??????????????????????????0.93? ? ? 5872
???macro avg? ?????0.87? ? ? 0.90? ? ? 0.89? ? ? 5872
weighted avg? ?????0.93? ? ? 0.93? ? ? 0.93? ? ? 5872
?
Explanation:
Division Name: Class Name List
General: Dresses, Blouses, Pants, Outerwear, Sweaters, Skirts, Knits, Fine gauge, Jackets, Trend, Jeans, Shorts, Casual bottoms
General Petite: Pants, Knits, Dresses, Blouses, Skirts, Fine gauge, Lounge, Jackets, Trend, Sweaters, Jeans, Outerwear
Initmates: Intimates, Lounge, Sleep, Swim, Legwear, Layering, Chemises
Process finished with exit code 0
Explanation:
?
?
?
?
?
?
8.Results and Findings
?Data Integrity: No missing values in features and target.
9.Interpretation
Model Performance Summary
Best Model Parameters:
Overall Model Performance:
These metrics indicate that the model performs well, accurately predicting recommendations for women's clothing with high precision and recall.
Confusion Matrix
Predicted: Not Recommended (0)
Predicted: Recommended (1)
Actual: Not Recommended (0)
? ? ? ? ? ? TN = 911
? ? ? ? ? ? ? FP = 153
Actual: Recommended (1)
? ? ? ? ? ? FN = 256
? ? ? ? ? ? ? TP = 4552
Explanation:
Classification Report
Class
Precision
Recall
F1-Score
Support
0 (Not Recommended)
0.78
? 0.86
0.82
1064
1 (Recommended)
0.97
0.95
0.96
4808
Conclusion
The logistic regression model with parameters C=10C = 10 and penalty=’l1? performs robustly in predicting whether women's clothing items are recommended or not.
Key Insights:
Visualization Confirmation
?Confusion Matrix:
Bar Plot for TP, TN, FP, FN:
These visualizations and performance metrics collectively demonstrate the model's effectiveness in recommending women's clothing items, offering valuable insights for product recommendation systems.
10.Next Steps(Future Analysis)
By following these steps, the model's predictive power can be enhanced, providing actionable insights to drive business growth and improve customer satisfaction.
11.Conclusion
Conclusion
The Women's Clothing recommendation project successfully demonstrates the application of machine learning and natural language processing (NLP) to predict customer preferences and improve product recommendations. Utilizing a logistic regression model with features such as age, rating, sentiment score, word count, and categorical variables, the model achieved high accuracy and strong performance metrics.
Key Outcomes:
Division-Class Analysis:
NLP Integration:
Future Directions: The project? "Sentiment-Based Sales Optimization"? Using Machine Learning and NLP lays a solid foundation for further enhancements, including integrating additional features like customer demographics, exploring advanced machine learning models, conducting feature importance analysis, and refining sentiment analysis on individual review aspects. Implementing these steps, along with continuous monitoring and retraining, will ensure the model remains relevant and effective in evolving market conditions.
Overall,? "Sentiment-Based Sales Optimization"? Using Machine Learning and NLP.showcases the potential of machine learning and NLP to drive better business decisions and improve customer satisfaction in the retail sector.
12.Acknowledgments
?
Data Scientist | Data Analyst | ML Engineer | SQL Developer
3 个月Great work! VALARMATHI GANESSIN