Exploring Sentiment Analysis with Python: A Case Study

Dr. Fatma Ben Mesmia Chaabouni

Professor @ Nest Academy of Management @BIUC @American Imperial University| Ph.D. in CS | MSc_B.Sc. in CS| NLP-AI and Data Analytics- Blockchain researcher | MBA mentor| Tunisian AI Society Member

发布日期: 2024年7月27日

As part of my ongoing learning journey in Python, I recently completed a case study on Sentiment Analysis. This practical exercise provided invaluable insights into data preprocessing, feature extraction, model training, and evaluation. Below, I share the Python code and walk you through each part of the process, highlighting the importance of each step.

Sentiment Analysis

Sentiment analysis is a crucial task in natural language processing (NLP) that involves determining the sentiment or emotional tone behind a body of text. It widely applies to customer feedback analysis, social media monitoring, and market research. For example, understanding the sentiment behind customer reviews can help businesses improve their products and services.

Loading and Preprocessing the Data

First, we load the dataset and perform initial preprocessing. The dataset used is yelp_review.csv, which contains Yelp reviews. You can download the dataset from Kaggle.

import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import string

# Download the stopwords from nltk
nltk.download('stopwords')

# Define the set of stopwords
stop = set(stopwords.words('english'))

# Define the path to your CSV file
csv_file_path = 'yelp_review.csv'

# Read the CSV file using pandas
df = pd.read_csv(csv_file_path)

# Display the first few rows of the dataframe
print(df.head())

Loading and preprocessing the data is the foundation of any data science project. Cleaning the text data ensures that our models receive consistent and noise-free input, improving their accuracy and reliability.

Cleaning the Text Data

Next, we define a function to clean the text data by removing URLs, special characters, and stopwords.

# Define a function to clean the document
def clean_document(doco):
    # Remove URLs
    doco = re.sub(r'http\S+', '', doco)
    # Remove words with %
    doco = re.sub(r'%\S+', '', doco)
    # Remove words with @
    doco = re.sub(r'@\S+', '', doco)
    # Remove hyphens
    doco = doco.replace('-', '')
    # Keep only alphanumeric characters and spaces
    doco = re.sub(r'\W+', ' ', doco)
    # Convert to lowercase
    doco = doco.lower()
    # Split the document into words
    doco_words = doco.split()
    # Remove stopwords
    doco_words = [word for word in doco_words if word not in stop]
    # Remove words with repeated characters (more than 3 consecutive identical characters)
    p = re.compile(r'\b[a-z\d]*([a-z\d])\1{3,}[a-z\d]*\b', re.IGNORECASE)
    doco_words = [word for word in doco_words if not p.match(word)]
    return doco_words

# Ensure the 'text' column is loaded correctly
if 'text' in df.columns:
    x = df['text']
    y = df['stars']  # Assuming you want to predict the 'stars' column
else:
    raise ValueError("The dataframe does not contain a 'text' column.")

Cleaning the text data is crucial for removing irrelevant information and focusing on the meaningful content, which significantly enhances the model's performance.

Feature Extraction

We use the CountVectorizer to transform the cleaned text data into a bag-of-words model.

领英推荐

Evaluating Python #1

Debjyoti Saha 1 年前

Unleashing Insights with Sentiment Analysis: Analyzing…

Huzaifa Khan 1 年前

An overview of the combined power of Twitter and Python

Meghna Goswami 5 年前

from sklearn.feature_extraction.text import CountVectorizer

# Use CountVectorizer with the custom analyzer
bow_transformer = CountVectorizer(analyzer=clean_document).fit(x)

# Transform the text data into a bag-of-words model
X = bow_transformer.transform(x)

Feature extraction converts text data into numerical format, which is essential for machine learning algorithms to process and analyze the data effectively.

Splitting the Data

We split the data into training and testing sets to evaluate the model's performance.

from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Copy code

Splitting the data ensures that we have a separate set of data to test the model's performance, helping us avoid overfitting and get an unbiased estimate of the model's accuracy.

Model Training and Evaluation

We train a Gradient Boosting Classifier and evaluate its accuracy.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
model.fit(X_train, y_train)

# Predict the labels for the test set
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
acc = accuracy_score(y_test, y_pred)
print("The accuracy of the model:", acc)

Training the model on the training data and evaluating it on the test data helps us measure the model's performance. High accuracy indicates that the model has learned the patterns in the data effectively.

Conclusion

This case study on sentiment analysis has been an enriching experience, showcasing the importance of each step in the data science workflow. From data preprocessing to model training and evaluation, each part plays a vital role in building a reliable sentiment analysis model. By understanding and implementing these techniques, we can unlock the potential of data to reveal hidden sentiments and trends.

#SentimentAnalysis #Python #DataScience #MachineLearning #ContinuousLearning #AI #NaturalLanguageProcessing

Feel free to reach out if you have any questions or want to discuss more about this project!

Exploring Sentiment Analysis with Python: A Case Study

Dr. Fatma Ben Mesmia Chaabouni

Professor @ Nest Academy of Management @BIUC @American Imperial University| Ph.D. in CS | MSc_B.Sc. in CS| NLP-AI and Data Analytics- Blockchain researcher | MBA mentor| Tunisian AI Society Member

Sentiment Analysis

Loading and Preprocessing the Data

Cleaning the Text Data

Feature Extraction

领英推荐

Splitting the Data

Model Training and Evaluation

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Real-Time Twitter Sentiment Analysis with Python: Analyzing Data for Insights

A Guide To Integrating Pythia With Text Summarizers

Develop AI Using Python: A Step-by-Step Guide

NLP(Natural language Processing)-Part 6

Python: The Backbone of Data Analysis

Poor Things - Analyzing the Screenplay with Python, LLMs and NLP

How can I learn artificial intelligence with a little bit of knowledge of Python?

Python Cosine Similarity

Python & NLP for SEO: A Basic Guide

Sentiment Analysis

Loading and Preprocessing the Data

Cleaning the Text Data

Feature Extraction

领英推荐

Splitting the Data

Model Training and Evaluation

Conclusion

Introduction to the Financial Analytics with Python

2024年9月13日

Insights from Online Retail Data

2024年8月7日

Salary Prediction with Python

2024年8月5日

Mastering C++ Fundamentals

2024年8月3日

Python's Power in Gaming: Insights on Logic and Life

2024年8月1日

Rock, Paper, Scissors: A Fun and Educational Game for All Ages

2024年7月30日

Transforming Theoretical Sessions into Enjoyable and Practical Experiences: A Compliance Reminder Game

2024年7月29日

Mastering ETL Processes Using PySpark

2024年7月28日

The Art of Numbers and Letters: My Journey into AI and Data Analytics

2024年7月26日

社区洞察

其他会员也浏览了

Real-Time Twitter Sentiment Analysis with Python: Analyzing Data for Insights

A Guide To Integrating Pythia With Text Summarizers

Develop AI Using Python: A Step-by-Step Guide

NLP(Natural language Processing)-Part 6

Python: The Backbone of Data Analysis

Poor Things - Analyzing the Screenplay with Python, LLMs and NLP

How can I learn artificial intelligence with a little bit of knowledge of Python?

Python Cosine Similarity

Python & NLP for SEO: A Basic Guide