Deep Dive into Natural Language Processing: A Practical Approach with Spam SMS Classification

Deep Dive into Natural Language Processing: A Practical Approach with Spam SMS Classification

As I continue to explore the fascinating world of Natural Language Processing (NLP), one real-world application that stands out is spam SMS classification. With the vast amount of text data being generated every day, whether it's SMS, emails, or social media posts, identifying and filtering out spam has become critical for both users and businesses. In this post, I'll walk through how NLP can help in the detection of spam messages, using Python’s NLTK package and various other tools to preprocess text data for machine learning models.

What is NLP and Why Does It Matter?

NLP is essentially about enabling computers to read, interpret, and analyze human language in a way that is useful. However, a computer doesn’t naturally "understand" text. If you type “CAT” in one instance and “cat” in another, the computer treats them differently, even though we humans know they’re the same. This is why text preprocessing is crucial—to make text "computer-readable" so it can be analyzed effectively for business insights.

Steps to Preprocess Text Data:

To build a robust machine learning model for spam classification, the first step is to clean and structure the raw text data. Let’s break down the key preprocessing steps:

  • Punctuation Removal: Punctuations like commas, periods, and exclamation marks don’t contribute much to the actual meaning of a sentence in most cases. By removing them, we focus on the words that carry meaning. Using Python’s string package, we can easily strip out these punctuations.

Punctuations from String Package

  • Tokenization: Tokenization involves splitting sentences into individual words, turning raw text into a list of words (or tokens). For example, "This is a spam message!" becomes ["This", "is", "a", "spam", "message"]. This can be done using Python's re package or functions like split().


Split the string using re Package

  • Stop Word Removal: Stop words are common words like "the", "is", "and" that appear frequently in language but don’t contribute much to the meaning of a sentence. Removing them helps to reduce the feature space and speeds up the model training process. For this, we use the NLTK stopwords package.


Stopwords from stopwords Package

  • Stemming and Lemmatization: Stemming chops off the ends of words to get their root forms, while lemmatization reduces words to their base form but with context (e.g., "running" becomes "run"). In our case, stemming can reduce variations of words like "run", "running", "runner" to the root form "run", helping us to generalize the dataset.


Stemming using PorterStemmer Package


Lemmatizing using WordNetLemmatizer Package

We can able to identify the differentiation of using these two methods by using this examples.

SMS data set after Preprocess Text Data

SMS Data set with Each stage of body_text Cleaning

Text Vectorization:

Once the text is preprocessed, we need to convert it into numerical data that a machine learning model can understand. This process is called vectorization.

Count Vectorizer:

This method counts the frequency of each word in the text and represents it as a feature. For example, in our spam SMS dataset, words like "win" or "free" might appear frequently in spam messages, which could help the model identify them. Each column consists of unique words appear in whole dataset and each row represents each SMS so after vectorization we can see there are about 8107 features has been created.


CountVectorizer

TF-IDF Vectorizer:

Instead of simple frequency counts, TF-IDF assigns weight to words based on how often they appear in a specific message compared to the entire dataset. Words that are common across many messages will have lower weights, while words that are unique to specific messages will carry higher importance. This helps differentiate spam from non-spam messages.

TF-IDF Vectorizer

In SMS data set we could able to see weights has been assigned to one cell value instead of frequency here.

TF-IDF Vectorizer

N-grams:

Sometimes single words (unigrams) aren’t enough. N-grams consider adjacent words together (bigrams, trigrams, etc.), which can provide richer context. For example, "free gift" might be a stronger indicator of spam than "free" alone.

In SMS data set I have created Bigram vectorizer we can see that number of features generated by Bigram combination of adjacent words are 31275.We can also notice that each column has combination of two words.

N-gram Vectorizer


Feature Engineering for Spam Detection:

To improve the performance of our spam detection model, we can introduce additional features:

Message Length: Spam messages tend to be longer, so counting the number of characters in a message can be useful.

Feature Creation - Message Length

We can clearly notice that spam messages are longer than normal messages.

Punctuation Frequency: Spam messages often contain a high percentage of exclamation marks or other punctuations, so measuring this can help flag suspicious messages.


Feature Creation - Punctuation

But here we can see that we cant differentiate the spam messages based on punctuations.

By analyzing these features with histograms, we can test our hypotheses and refine the model's input data accordingly.

Building the Model:

After preprocessing, the next step is building the machine learning model. In this case, I used a Random Forest Classifier to classify SMS messages as spam or non-spam. Here's how we can make the model robust and scalable:

Cross-Validation

Cross-validation helps ensure that the model performs well across different subsets of the data. For example, with 5-fold cross-validation, the data is split into three sets—training on two and testing on one rotating the test set to ensure consistent performance.

Here we can see that model performer well and producing accuracy close to or above 97%

K Fold Cross validation

Hyperparameter Tuning:

By tuning the parameters of the Random Forest (such as the number of trees or the maximum depth), we can improve its performance. This is done using Grid Search CV, which helps us identify the best combination of hyperparameters for our specific dataset.

We can able identify the best combination of Hyperparameter using GridsearchCV.

GridSearchCV

Final Model Building:

We can select the best-performing model by evaluating the results from cross-validation and fine-tuning it using hyperparameters identified through Grid Search CV. With these optimized values, we can then build our final model for deployment.


Final Model

Key Insights from the Spam SMS Example:

  • Preprocessing is vital for transforming raw text into something a machine learning model can understand.
  • Vectorization techniques like TF-IDF provide richer context than simple word counts, enabling more accurate spam detection.
  • Additional features such as message length and punctuation frequency can further enhance the model’s ability to distinguish between spam and non-spam messages.
  • Cross-validation and hyperparameter tuning are essential steps to ensure that the model generalizes well and performs consistently across different data samples.

Conclusion:

NLP offers a powerful way to extract insights from text data, whether it's identifying spam messages, understanding customer sentiment, or analyzing product reviews. By combining effective text preprocessing, vectorization, and robust machine learning models, we can turn unstructured text into actionable business intelligence.

#NLP #DataScience #MachineLearning #Python #SpamDetection #RandomForest #TextPreprocessing #TFIDF #FeatureEngineering #AI







Shan Suthaharan

Professor of Computer Science at UNC-Greensboro

6 个月

Very nice! Way to go!

Eric Lane

Customer Success Strategist | Enhancing Client Experiences through Strategic Solutions

6 个月

Interesting approach for spam detection

要查看或添加评论,请登录

Prasanna Kaarthi Dhanabalan的更多文章

社区洞察

其他会员也浏览了