Exploring Sentiment Analysis with Python: A Case Study
Dr. Fatma Ben Mesmia Chaabouni
Professor @ Nest Academy of Management @BIUC @American Imperial University| Ph.D. in CS | MSc_B.Sc. in CS| NLP-AI and Data Analytics- Blockchain researcher | MBA mentor| Tunisian AI Society Member
As part of my ongoing learning journey in Python, I recently completed a case study on Sentiment Analysis. This practical exercise provided invaluable insights into data preprocessing, feature extraction, model training, and evaluation. Below, I share the Python code and walk you through each part of the process, highlighting the importance of each step.
Sentiment Analysis
Sentiment analysis is a crucial task in natural language processing (NLP) that involves determining the sentiment or emotional tone behind a body of text. It widely applies to customer feedback analysis, social media monitoring, and market research. For example, understanding the sentiment behind customer reviews can help businesses improve their products and services.
Loading and Preprocessing the Data
First, we load the dataset and perform initial preprocessing. The dataset used is yelp_review.csv, which contains Yelp reviews. You can download the dataset from Kaggle.
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import string
# Download the stopwords from nltk
nltk.download('stopwords')
# Define the set of stopwords
stop = set(stopwords.words('english'))
# Define the path to your CSV file
csv_file_path = 'yelp_review.csv'
# Read the CSV file using pandas
df = pd.read_csv(csv_file_path)
# Display the first few rows of the dataframe
print(df.head())
Cleaning the Text Data
Next, we define a function to clean the text data by removing URLs, special characters, and stopwords.
# Define a function to clean the document
def clean_document(doco):
# Remove URLs
doco = re.sub(r'http\S+', '', doco)
# Remove words with %
doco = re.sub(r'%\S+', '', doco)
# Remove words with @
doco = re.sub(r'@\S+', '', doco)
# Remove hyphens
doco = doco.replace('-', '')
# Keep only alphanumeric characters and spaces
doco = re.sub(r'\W+', ' ', doco)
# Convert to lowercase
doco = doco.lower()
# Split the document into words
doco_words = doco.split()
# Remove stopwords
doco_words = [word for word in doco_words if word not in stop]
# Remove words with repeated characters (more than 3 consecutive identical characters)
p = re.compile(r'\b[a-z\d]*([a-z\d])\1{3,}[a-z\d]*\b', re.IGNORECASE)
doco_words = [word for word in doco_words if not p.match(word)]
return doco_words
# Ensure the 'text' column is loaded correctly
if 'text' in df.columns:
x = df['text']
y = df['stars'] # Assuming you want to predict the 'stars' column
else:
raise ValueError("The dataframe does not contain a 'text' column.")
Feature Extraction
We use the CountVectorizer to transform the cleaned text data into a bag-of-words model.
领英推荐
from sklearn.feature_extraction.text import CountVectorizer
# Use CountVectorizer with the custom analyzer
bow_transformer = CountVectorizer(analyzer=clean_document).fit(x)
# Transform the text data into a bag-of-words model
X = bow_transformer.transform(x)
Splitting the Data
We split the data into training and testing sets to evaluate the model's performance.
from sklearn.model_selection import train_test_split
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Copy code
Model Training and Evaluation
We train a Gradient Boosting Classifier and evaluate its accuracy.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Initialize and train the GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
model.fit(X_train, y_train)
# Predict the labels for the test set
y_pred = model.predict(X_test)
# Calculate the accuracy of the model
acc = accuracy_score(y_test, y_pred)
print("The accuracy of the model:", acc)
Conclusion
This case study on sentiment analysis has been an enriching experience, showcasing the importance of each step in the data science workflow. From data preprocessing to model training and evaluation, each part plays a vital role in building a reliable sentiment analysis model. By understanding and implementing these techniques, we can unlock the potential of data to reveal hidden sentiments and trends.
#SentimentAnalysis #Python #DataScience #MachineLearning #ContinuousLearning #AI #NaturalLanguageProcessing
Feel free to reach out if you have any questions or want to discuss more about this project!