Consumer Sentiment Analysis Using Machine Learning Algorithm

Consumer Sentiment Analysis Using Machine Learning Algorithm


In today's digital landscape, understanding consumer sentiment is crucial for business success.

This project dives into leveraging Natural Language Processing (NLP) to analyze sentiment and gain insights into brand perception.


TL;DR

  • Gathered comments mentioning the target brand (Used test dataset in this project)
  • Assigned sentiment labels (positive, negative, or neutral) to each comment using Natual Language Processing (NLP)
  • Used the Bernoulli Naive Bayes to predict the sentiment of comment; returned with an accuracy score of 0.84



Let's dive into the project:

Step 1. Data Acquisition

We'll begin by gathering tweets mentioning the target brand. Here are two options:

  1. Twitter API: Utilize the Twitter Developer API to extract tweets containing the brand's name and relevant hashtags.
  2. Pre-built Datasets: Explore publicly available Twitter datasets on platforms like Kaggle that might align with your brand.


In this project, we will use pre-build test datasets:

Download the CSV file


Step 2. Data Cleaning

Raw Twitter data often contains noise. Here's how we'll clean it:

  • Lowercasing: Convert all text to lowercase for consistency.
  • Stop Word Removal: Eliminate common words like "the," "a," and "is" that don't contribute to sentiment analysis.
  • Punctuation Removal: Remove punctuation marks like commas, periods, and exclamation points.

Cleaning dataset

  • Stemming / Lemmatizing:

Convert raw text data into a structured format for machine processing by reducing words to their base form. Stemming reduces the words to their derived stems, while lemmatization reduces the derived words to their root form, lemma.

  • Stemming "running" => "run" (as a word)
  • Lemmatizing "running" => "run" (as a verb)

In this project, we will do both to have a better result.

Stemming & lemmatizing tokenized text data


Step 3. Sentiment Labeling

To train our model, we need labeled data. Here are two approaches:

  1. Manual Labeling: Manually classify a subset of tweets as positive, negative, or neutral. This can be time-consuming but ensures high-quality labels.
  2. Lexicon-Based Labeling: Utilize sentiment lexicons like VADER (Valence Aware Dictionary and Entiment Reasoner) that assign sentiment scores to words.

Wordcloud of positive comments


Step 4. Transforming the Test & Train Dataset

After separating the 95% data for training data and 5% for testing data, transform the dataset using the TF-IDF vectorizer.


TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic that reflects a word's importance in a document relative to the entire document collection (corpus). It considers two factors:

  • Term Frequency (TF): How often a word appears in a specific document, and
  • Inverse Document Frequency (IDF): How common the word is across all documents. Rare words have higher IDF scores.

Benefits of TF-IDF:

  • Useful in NLP tasks like topic modeling and machine learning tasks like classification.
  • Helps algorithms identify words that are more relevant to a specific document and less frequent overall, leading to better predictions.

Transforming the dataset with TF-IDF vectorizer



Step 5: Model Training and Evaluation

In this project, we will use the Bernoulli Naive Bayes Classifier to deal with large datasets efficiently.

Bernoulli Naive Bayes is a type of classifier algorithm under the Naive Bayes classifiers designed for tasks where you want to predict categories for data.

  • Relies on Bayes' theorem to calculate the probability of an event based on evidence: i.e., if an email contains certain words, classify it as spam.
  • Works best when the features are binary (0 or 1): i.e., an email can be classified as spam (1) or not spam (0)
  • Useful for classifying short text data because it can effectively model the absence or presence of specific terms within the text.
  • Simple to understand and implement, and computationally efficient.


After training the model, we apply the evaluation measures to check how the model is performing using the following methods:

  • Accuracy Score: The most basic metric that represents the percentage of correct predictions by the model. A high accuracy score (e.g., 90%) suggests the model makes good predictions in general.

  • Confusion Matrix with Plot: A visualization tool that breaks down the model's performance on a classification task. A good model will have high values on the diagonal (correct predictions) and low values off the diagonal (incorrect predictions)
  • ROC-AUC Curve: The Receiver Operating Characteristic (ROC) curve is a performance measure for classification models at various classification thresholds. A good model will have an ROC curve that stays close to the top-left corner, indicating high TPR (correctly classifying positive cases) with low FPR (incorrectly classifying negative cases). The Area Under the ROC Curve (AUC) summarizes the overall performance, with a higher AUC (closer to 1) indicating better performance.

Train and evaluate the model


Results

Data visualization is key to presenting insights. Create charts and graphs to showcase the distribution of sentiment towards the brand and identify trends over time.



Conclusion

By harnessing NLP, we can unlock valuable insights from social media data. This project demonstrates a practical approach to understanding consumer sentiment and informing brand strategy. However, there are limitations:

  • Bernoulli Naive Bayes can be misleading when the assumption of feature independence can lead to inaccurate predictions in some cases or data with continuous features (numbers) or high-dimensional data (many features).
  • The Accuracy Score can be misleading in certain situations where there's a large class imbalance (e.g., mostly positive or negative examples) as a model can achieve high accuracy by simply predicting the majority class, even if it doesn't perform well on the minority class.
  • Depending on the application, some errors might be more critical than others, but accuracy doesn't tell us the nature of the errors.


Beyond the Basics

  • Explore more advanced algorithms like Support Vector Machines (SVMs) or Long Short-Term Memory (LSTM) networks for potentially better performance.
  • Analyze specific topics or emotions discussed alongside the brand sentiment.
  • Build a real-time sentiment analysis dashboard to continuously monitor brand perception.



Reference:

Research Gate, Real-Time Twitter Sentiment Analysis using Natural Language Processing

What are stemming and lemmatization?

Convert Text Documents to a TF-IDF Matrix with tfidfvectorizer

Scikit Learn User Guide


This is a fascinating approach to understanding brand perception! Sentiment analysis really opens up valuable insights for brands. How do you see these insights influencing strategic decisions moving forward?

回复

要查看或添加评论,请登录