Exploring Sentiment Analysis with Python: A Case Study

Exploring Sentiment Analysis with Python: A Case Study


As part of my ongoing learning journey in Python, I recently completed a case study on Sentiment Analysis. This practical exercise provided invaluable insights into data preprocessing, feature extraction, model training, and evaluation. Below, I share the Python code and walk you through each part of the process, highlighting the importance of each step.

Sentiment Analysis

Sentiment analysis is a crucial task in natural language processing (NLP) that involves determining the sentiment or emotional tone behind a body of text. It widely applies to customer feedback analysis, social media monitoring, and market research. For example, understanding the sentiment behind customer reviews can help businesses improve their products and services.

Loading and Preprocessing the Data

First, we load the dataset and perform initial preprocessing. The dataset used is yelp_review.csv, which contains Yelp reviews. You can download the dataset from Kaggle.

import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import string

# Download the stopwords from nltk
nltk.download('stopwords')

# Define the set of stopwords
stop = set(stopwords.words('english'))

# Define the path to your CSV file
csv_file_path = 'yelp_review.csv'

# Read the CSV file using pandas
df = pd.read_csv(csv_file_path)

# Display the first few rows of the dataframe
print(df.head())        

  • Loading and preprocessing the data is the foundation of any data science project. Cleaning the text data ensures that our models receive consistent and noise-free input, improving their accuracy and reliability.

Cleaning the Text Data

Next, we define a function to clean the text data by removing URLs, special characters, and stopwords.

# Define a function to clean the document
def clean_document(doco):
    # Remove URLs
    doco = re.sub(r'http\S+', '', doco)
    # Remove words with %
    doco = re.sub(r'%\S+', '', doco)
    # Remove words with @
    doco = re.sub(r'@\S+', '', doco)
    # Remove hyphens
    doco = doco.replace('-', '')
    # Keep only alphanumeric characters and spaces
    doco = re.sub(r'\W+', ' ', doco)
    # Convert to lowercase
    doco = doco.lower()
    # Split the document into words
    doco_words = doco.split()
    # Remove stopwords
    doco_words = [word for word in doco_words if word not in stop]
    # Remove words with repeated characters (more than 3 consecutive identical characters)
    p = re.compile(r'\b[a-z\d]*([a-z\d])\1{3,}[a-z\d]*\b', re.IGNORECASE)
    doco_words = [word for word in doco_words if not p.match(word)]
    return doco_words

# Ensure the 'text' column is loaded correctly
if 'text' in df.columns:
    x = df['text']
    y = df['stars']  # Assuming you want to predict the 'stars' column
else:
    raise ValueError("The dataframe does not contain a 'text' column.")
        

  • Cleaning the text data is crucial for removing irrelevant information and focusing on the meaningful content, which significantly enhances the model's performance.

Feature Extraction

We use the CountVectorizer to transform the cleaned text data into a bag-of-words model.

from sklearn.feature_extraction.text import CountVectorizer

# Use CountVectorizer with the custom analyzer
bow_transformer = CountVectorizer(analyzer=clean_document).fit(x)

# Transform the text data into a bag-of-words model
X = bow_transformer.transform(x)        

  • Feature extraction converts text data into numerical format, which is essential for machine learning algorithms to process and analyze the data effectively.

Splitting the Data

We split the data into training and testing sets to evaluate the model's performance.

from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Copy code        

  • Splitting the data ensures that we have a separate set of data to test the model's performance, helping us avoid overfitting and get an unbiased estimate of the model's accuracy.

Model Training and Evaluation

We train a Gradient Boosting Classifier and evaluate its accuracy.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
model.fit(X_train, y_train)

# Predict the labels for the test set
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
acc = accuracy_score(y_test, y_pred)
print("The accuracy of the model:", acc)        

  • Training the model on the training data and evaluating it on the test data helps us measure the model's performance. High accuracy indicates that the model has learned the patterns in the data effectively.

Conclusion

This case study on sentiment analysis has been an enriching experience, showcasing the importance of each step in the data science workflow. From data preprocessing to model training and evaluation, each part plays a vital role in building a reliable sentiment analysis model. By understanding and implementing these techniques, we can unlock the potential of data to reveal hidden sentiments and trends.

#SentimentAnalysis #Python #DataScience #MachineLearning #ContinuousLearning #AI #NaturalLanguageProcessing

Feel free to reach out if you have any questions or want to discuss more about this project!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了