Natural Language Processing: Linear Text Classification
RISHABH SINGH
Actively looking for Full-time Opportunities in AI/ML/Robotics | Ex-Algorithms & ML Engineer @ Dynocardia Inc | Computer Vision Research Assistant & Robotics Graduate Student @Northeastern University
Linear classification refers to using a straight line (or hyperplane in higher dimensions) to separate different classes in a dataset. It’s one of the simplest and most interpretable models for classification tasks, where we try to predict the category or class of a given input based on its features.
In NLP, this often involves text classification, where the goal is to assign a category (like “spam” or “not spam”) to a piece of text.
Representation:
To apply machine learning models to text, we need to convert the text into a numerical format that models can understand. One common way to do this is by using the Bag of Words (BoW) approach.
Bag of Words?(BoW)
Example of Bag of?Words:
Imagine we have the following sentences:
To represent these sentences as a bag of words, we list all unique words in the dataset and count their occurrences in each sentence:
This is a simple numeric representation of text, and we can use this matrix to feed into machine learning models
Learning Algorithms:
There are several algorithms for learning from this representation. Let’s briefly discuss some algorithms (Naive Bayes, Perceptron, Logistic Regression) with some code examples for better understanding.
a) Na?ve?Bayes
Na?ve Bayes is a probabilistic classifier based on Bayes’ theorem. It assumes that all features (inputs) are independent of each other, meaning that the presence (or absence) of one feature doesn’t affect the others, which makes it “naive.”
Understanding Bayes’?Theorem
Before diving into Naive Bayes, let’s break down Bayes’ Theorem, the mathematical foundation of the Naive Bayes classifier. Bayes’ Theorem helps us calculate the probability of an event happening, given some prior knowledge.
The formula for Bayes’ Theorem is:
Where:
In the context of Naive Bayes:
Example: Classifying Emails as “Spam” or “Not?Spam”
Let’s walk through a simple example of using Naive Bayes to classify emails as either spam or not spam.
Step 1: Training?Data
Imagine we have the following training data, where each email is labeled as spam or not spam, and we count the occurrences of specific words:
Step 2: Calculate Probabilities
For each word, we calculate the likelihood (probability of a word appearing in spam or not spam emails).
2. Likelihood of Each Word in Spam Emails:
3. Likelihood of Each Word in Not Spam Emails:
Naive Bayes performs text classification in the following way:
—?—?— Prior Probability: The probability of each class (positive or negative) based on the overall distribution of the labels in the training data.
—?—?— Likelihood Probability: The probability of each word occurring in the document, given a class.
These probabilities are learned from the training data by counting how often each word appears in documents of each class.
Code Example: Na?ve Bayes Text Classification
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
# Sample data
texts = ["Natural language processing is fascinating",
"I love learning about natural language", "Python is great for NLP"]
labels = [1, 1, 0] # 1 for language-related, 0 for others
# Convert text to numerical format (Bag of Words)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)
# Train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
print(predictions)
b) Perceptron
The perceptron is one of the simplest types of neural networks, consisting of a single layer of input nodes that are fully connected to a layer of output nodes, which makes predictions by finding a linear boundary between classes.
There are 2 types of perceptron: Single-layer Perceptron and Multilayer Perceptron.
A basic perceptron works by taking in some numerical inputs along with what is known as weights and a bias. It then multiplies these inputs with the respective weights (this is known as the weighted sum). These products are then added together along with the bias. The activation function takes the weighted sum and the bias as inputs and returns a final output.
MLP performs text classification in the following way?:
from sklearn.linear_model import Perceptron
# Same data as above
model = Perceptron()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
print(predictions)
c) Logistic Regression
Although it has “regression” in its name, logistic regression is used for classification tasks. It models the probability of an instance belonging to a certain class.
领英推荐
Where:
—?—?x is the vector of input features (the word vector for the document).
—?—?w is the vector of weights associated with each feature.
—?—?b is a bias term.
—?—?The output is a probability between 0 and 1 that represents the likelihood of the document being positive.
from sklearn.linear_model import LogisticRegression
# Same data as above
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
print(predictions)
Practical Example of Text Classification
Let’s consider a simple practical example of classifying movie reviews as positive or negative using logistic regression.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Example movie reviews
reviews = ["I love this movie", "This was a terrible movie",
"Absolutely fantastic!", "Not great, very boring"]
labels = [1, 0, 1, 0] # 1 for positive, 0 for negative
# Convert text to Bag of Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews)
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25)
# Train a Logistic Regression classifier
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
print(predictions)
This simple example takes a few movie reviews, transforms them into a numerical format using Bag of Words, and then uses logistic regression to classify them as positive or negative.
While Bag of Words is a simple and widely-used method, there are other more advanced techniques for text representation in NLP, one such technique is TF-IDF (Term Frequency-Inverse Document Frequency): A variant of BoW that gives more importance to rare words and less to common ones.
1. What is?TF-IDF?
TF-IDF is a numerical statistic used in NLP to evaluate how important a word is within a document compared to a collection of documents (called a corpus). The idea behind TF-IDF is to highlight words that are frequent in a specific document but less common across many documents, making them more important or unique to that document.
TF-IDF Components:
TF-IDF Score:
The final TF-IDF score for a word in a document is calculated by multiplying its TF and IDF values:
Words that are frequent in a single document but not common in many documents will have higher TF-IDF scores.
2. Example of TF-IDF Calculation:
Imagine you have a corpus of three documents:
Goal:
We want to calculate the TF-IDF score for the word “machine” in the first document.
Step-by-Step Calculation:
Thus, the TF-IDF score for ‘machine’ in Document 1 is 0.044.
3. Practical Example in?Python
The TF-IDF concept is implemented in Python using the TfidfVectorizer from the scikit-learn library. Below is a step-by-step guide on how to implement TF-IDF and use it for text classification.
Code Walkthrough:
Import Required Libraries: We need several libraries for vectorization (TF-IDF) and for applying machine learning models like Naive Bayes.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
Sample data (replace with your own dataset)
documents = ["This is a positive document.", "Negative sentiment detected.",
"Another positive example.", "Negative review here." ]
labels = [1, 0, 1, 0] # 1 for positive, 0 for negative
Split the Data: To train and test the model, split the data into training and testing sets. The train_test_split function automatically splits the dataset.
X_train, X_test, y_train, y_test = train_test_split(documents, labels,
test_size=0.2, random_state=42)
TF-IDF Vectorization: Now, we convert the text data into numerical format using TF-IDF.
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
Train the Naive Bayes Classifier (any classifier you want, you can also train logistic regression, SVM, MLP etc.,): I am using the Multinomial Naive Bayes algorithm, which is often used for text classification.
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)
Make Predictions: After training the model, we use it to predict the labels of the test data.
predictions = classifier.predict(X_test_tfidf)
Evaluate the Model: We evaluate the performance of the model using accuracy score and classification report to see how well the classifier performed.
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}") print("Classification Report:\n", report)
Practical Example?Summary:
Here’s the full code for the practical example again:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
# Sample data
documents = [
"This is a positive document.",
"Negative sentiment detected.",
"Another positive example.",
"Negative review here."
]
labels = [1, 0, 1, 0] # 1 for positive, 0 for negative
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(documents, labels, test_size=0.2, random_state=42)
# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# Train the Naive Bayes Classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)
# Make predictions
predictions = classifier.predict(X_test_tfidf)
# Evaluate the classifier
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", report)