Natural Language Processing: Linear Text Classification

Natural Language Processing: Linear Text Classification


Linear classification refers to using a straight line (or hyperplane in higher dimensions) to separate different classes in a dataset. It’s one of the simplest and most interpretable models for classification tasks, where we try to predict the category or class of a given input based on its features.

In NLP, this often involves text classification, where the goal is to assign a category (like “spam” or “not spam”) to a piece of text.

Representation:

To apply machine learning models to text, we need to convert the text into a numerical format that models can understand. One common way to do this is by using the Bag of Words (BoW) approach.

Bag of Words?(BoW)

  1. BoW represents text data by counting the occurrences of words in a document.
  2. It disregards grammar and word order, treating each document as a collection of words.
  3. The output is a sparse matrix where each row represents a document, and each column represents a unique word in the corpus. ?
  4. The value in each cell typically indicates the frequency of a word in the corresponding document. ?
  5. BoW does not consider the importance of words in the document or across the corpus; it only counts their occurrences.

Example of Bag of?Words:

Imagine we have the following sentences:

  • “Natural language processing is fascinating.”
  • “I love learning about natural language.”

To represent these sentences as a bag of words, we list all unique words in the dataset and count their occurrences in each sentence:

This is a simple numeric representation of text, and we can use this matrix to feed into machine learning models

Learning Algorithms:

There are several algorithms for learning from this representation. Let’s briefly discuss some algorithms (Naive Bayes, Perceptron, Logistic Regression) with some code examples for better understanding.

a) Na?ve?Bayes

Na?ve Bayes is a probabilistic classifier based on Bayes’ theorem. It assumes that all features (inputs) are independent of each other, meaning that the presence (or absence) of one feature doesn’t affect the others, which makes it “naive.”

Understanding Bayes’?Theorem

Before diving into Naive Bayes, let’s break down Bayes’ Theorem, the mathematical foundation of the Naive Bayes classifier. Bayes’ Theorem helps us calculate the probability of an event happening, given some prior knowledge.

The formula for Bayes’ Theorem is:

Where:

  • P(x/y) is the posterior probability: the probability of event “x” happening given that “y” is true.
  • P(y/x) is the likelihood: the probability of event “y” happening given that “x” is true.
  • P(x) is the prior probability: the initial probability of event “x” happening.
  • P(y) is the evidence: the probability of event “y” happening.

In the context of Naive Bayes:

  • “x” is the class (e.g., spam or not spam).
  • “y” is the feature or evidence (e.g., words in an email).

Example: Classifying Emails as “Spam” or “Not?Spam”

Let’s walk through a simple example of using Naive Bayes to classify emails as either spam or not spam.

Step 1: Training?Data

Imagine we have the following training data, where each email is labeled as spam or not spam, and we count the occurrences of specific words:

Step 2: Calculate Probabilities

For each word, we calculate the likelihood (probability of a word appearing in spam or not spam emails).

  1. Prior Probabilities (how likely each class is):

  • P(Spam) = 2/4 = 0.5
  • P(Not Spam) = 2/4 = 0.5

2. Likelihood of Each Word in Spam Emails:

  • P(Buy/Spam) = 1/2 = 0.5
  • P(Offer/ Spam) = 2/2 = 1.0
  • P(Hello/Spam) = 0/2 = 0.0

3. Likelihood of Each Word in Not Spam Emails:

  • P(Buy/Not Spam) = 0/2 = 0.0
  • P(Offer/Not Spam) = 0/2 = 0.0
  • P(Hello/Not Spam) = 2/2 = 1.0

Naive Bayes performs text classification in the following way:

  • Text Preprocessing: Tokenize text, remove stopwords, and apply lemmatization.
  • Vectorization: Convert text into numeric vectors.
  • After vectorization, each document is represented as a bag of words (counts or weighted scores).
  • Naive Bayes assumes that features (words) are independent of each other given the class label (sentiment). This is known as the “naive” assumption.
  • For text classification, it treats each word in the document as contributing independently to the probability of a document being positive or negative.
  • Naive Bayes calculates two key types of probabilities:

—?—?— Prior Probability: The probability of each class (positive or negative) based on the overall distribution of the labels in the training data.

—?—?— Likelihood Probability: The probability of each word occurring in the document, given a class.

These probabilities are learned from the training data by counting how often each word appears in documents of each class.

  • For a new document, Naive Bayes uses Bayes’ Theorem to calculate the probability that the document belongs to each class (positive or negative):


  • The same calculation is done for the negative class, and the class with the higher probability is chosen as the predicted label.
  • Class Prediction: The class with the highest posterior probability is selected. If P(Positive∣document)>P(Negative∣document) the review is classified as positive, otherwise negative.

Code Example: Na?ve Bayes Text Classification

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Sample data
texts = ["Natural language processing is fascinating", 
"I love learning about natural language", "Python is great for NLP"]

labels = [1, 1, 0]  # 1 for language-related, 0 for others

# Convert text to numerical format (Bag of Words)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)

# Train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)
print(predictions)        

b) Perceptron

The perceptron is one of the simplest types of neural networks, consisting of a single layer of input nodes that are fully connected to a layer of output nodes, which makes predictions by finding a linear boundary between classes.

There are 2 types of perceptron: Single-layer Perceptron and Multilayer Perceptron.

A basic perceptron works by taking in some numerical inputs along with what is known as weights and a bias. It then multiplies these inputs with the respective weights (this is known as the weighted sum). These products are then added together along with the bias. The activation function takes the weighted sum and the bias as inputs and returns a final output.

MLP performs text classification in the following way?:

  • Text Preprocessing: Tokenize text, remove stopwords, apply lemmatization to reduce words to their base forms.
  • Vectorization: Convert text into numerical vectors using Bag Of Words.
  • Vectorized text is fed into the input layer, where each word or feature is represented as a numeric value.
  • Each word in the vocabulary corresponds to a neuron in the input layer.
  • Input passes through one or more hidden layers.
  • Each neuron in the hidden layer computes a weighted sum of its inputs, applies an activation function (like ReLU or Sigmoid), introducing non-linearity.
  • Weights are adjusted to learn the relationship between words and sentiment.
  • Forward Propagation: Input flows through the network, producing predictions (e.g., positive or negative).
  • Loss Function: Measures the error between predicted and actual labels (e.g., using cross-entropy).
  • Backpropagation: The error is propagated back through the network, and weights are adjusted using optimization algorithms (e.g., gradient descent) to minimize the error.
  • After training, new input text is passed through the network, which uses the learned weights to predict sentiment.
  • MLP generalizes well to unseen data, making accurate predictions for text classification (e.g., movie reviews).

from sklearn.linear_model import Perceptron
# Same data as above
model = Perceptron()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
print(predictions)        

c) Logistic Regression

Although it has “regression” in its name, logistic regression is used for classification tasks. It models the probability of an instance belonging to a certain class.

  • Text Preprocessing: Tokenize the text, remove stopwords, and apply lemmatization to reduce words to their base forms.
  • Vectorization: Convert text into numerical vectors.
  • After vectorization, each document is represented as a feature vector, where each feature corresponds to the frequency or importance of a specific word in the text.
  • Logistic Regression is a linear model that predicts the probability of a binary class (e.g., positive or negative sentiment) by fitting a linear decision boundary between the two classes.
  • It works by modeling the relationship between the input features (words) and the class labels using a logistic (sigmoid) function.

Where:

—?—?x is the vector of input features (the word vector for the document).

—?—?w is the vector of weights associated with each feature.

—?—?b is a bias term.

—?—?The output is a probability between 0 and 1 that represents the likelihood of the document being positive.

  • Logistic Regression learns the weights w for each word (or feature) in the document during training. It does this by maximizing the likelihood of the correct class (positive or negative) using the logistic function.
  • The algorithm adjusts the weights so that documents containing words strongly associated with positive reviews (e.g., “excellent,” “amazing”) will have higher probabilities of being classified as positive, and similarly for negative reviews.
  • Optimization (Gradient Descent): During training, Logistic Regression uses an optimization algorithm like gradient descent to minimize the error (difference between predicted probabilities and true labels) by adjusting the weights w and bias b.
  • The goal is to find the optimal set of weights that best separates positive and negative reviews.
  • Loss Function: Logistic Regression uses binary cross-entropy (log loss) to measure how far the predictions are from the true labels. The algorithm iteratively adjusts the weights to reduce this loss.
  • Once trained, Logistic Regression can be used to classify new documents.

  1. The new document is preprocessed and vectorized into a feature vector.
  2. The vector is fed into the model, which applies the learned weights w and bias b to compute the logistic function.
  3. The output is a probability that the document belongs to the positive class.

  • Decision Rule: If the output probability is greater than 0.5, the document is classified as positive. If the probability is less than 0.5, the document is classified as negative.

from sklearn.linear_model import LogisticRegression
# Same data as above
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
print(predictions)        

Practical Example of Text Classification

Let’s consider a simple practical example of classifying movie reviews as positive or negative using logistic regression.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Example movie reviews
reviews = ["I love this movie", "This was a terrible movie", 
"Absolutely fantastic!", "Not great, very boring"]

labels = [1, 0, 1, 0]  # 1 for positive, 0 for negative

# Convert text to Bag of Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25)

# Train a Logistic Regression classifier
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)
print(predictions)        

This simple example takes a few movie reviews, transforms them into a numerical format using Bag of Words, and then uses logistic regression to classify them as positive or negative.

While Bag of Words is a simple and widely-used method, there are other more advanced techniques for text representation in NLP, one such technique is TF-IDF (Term Frequency-Inverse Document Frequency): A variant of BoW that gives more importance to rare words and less to common ones.

1. What is?TF-IDF?

TF-IDF is a numerical statistic used in NLP to evaluate how important a word is within a document compared to a collection of documents (called a corpus). The idea behind TF-IDF is to highlight words that are frequent in a specific document but less common across many documents, making them more important or unique to that document.

TF-IDF Components:

  • Term Frequency (TF): This measures how frequently a term appears in a document. It is the raw count of the word in the document.

  • Inverse Document Frequency (IDF): This reduces the weight of words that appear frequently across multiple documents. A word that appears in many documents is considered less informative.

TF-IDF Score:

The final TF-IDF score for a word in a document is calculated by multiplying its TF and IDF values:

Words that are frequent in a single document but not common in many documents will have higher TF-IDF scores.

2. Example of TF-IDF Calculation:

Imagine you have a corpus of three documents:

  1. “I love machine learning.”
  2. “I love deep learning.”
  3. “Deep blue is a chess machine.”

Goal:

We want to calculate the TF-IDF score for the word “machine” in the first document.

Step-by-Step Calculation:

  • Step 1: Calculate Term Frequency (TF) In the first document, the word “machine” appears once out of a total of 4 words. So:

  • Step 2: Calculate Inverse Document Frequency (IDF) The word “machine” appears in 2 out of 3 documents (in Document 1 and Document 3). So:

  • Step 3: Calculate TF-IDF Multiply TF and IDF:

Thus, the TF-IDF score for ‘machine’ in Document 1 is 0.044.


3. Practical Example in?Python

The TF-IDF concept is implemented in Python using the TfidfVectorizer from the scikit-learn library. Below is a step-by-step guide on how to implement TF-IDF and use it for text classification.

Code Walkthrough:

Import Required Libraries: We need several libraries for vectorization (TF-IDF) and for applying machine learning models like Naive Bayes.

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import train_test_split 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.metrics import accuracy_score, classification_report        

Sample data (replace with your own dataset)

documents = ["This is a positive document.", "Negative sentiment detected.",
"Another positive example.", "Negative review here." ]
labels = [1, 0, 1, 0]  # 1 for positive, 0 for negative        

Split the Data: To train and test the model, split the data into training and testing sets. The train_test_split function automatically splits the dataset.

X_train, X_test, y_train, y_test = train_test_split(documents, labels, 
test_size=0.2, random_state=42)        

TF-IDF Vectorization: Now, we convert the text data into numerical format using TF-IDF.

vectorizer = TfidfVectorizer() 
X_train_tfidf = vectorizer.fit_transform(X_train) 
X_test_tfidf = vectorizer.transform(X_test)        

  • fit_transform is applied to the training data, where the model learns the vocabulary and applies the TF-IDF transformation.
  • transform is applied to the test data using the same vocabulary learned from the training data.

Train the Naive Bayes Classifier (any classifier you want, you can also train logistic regression, SVM, MLP etc.,): I am using the Multinomial Naive Bayes algorithm, which is often used for text classification.

classifier = MultinomialNB() 
classifier.fit(X_train_tfidf, y_train)        

Make Predictions: After training the model, we use it to predict the labels of the test data.

predictions = classifier.predict(X_test_tfidf)        

Evaluate the Model: We evaluate the performance of the model using accuracy score and classification report to see how well the classifier performed.

accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)  
print(f"Accuracy: {accuracy:.2f}") print("Classification Report:\n", report)        

Practical Example?Summary:

Here’s the full code for the practical example again:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample data
documents = [
    "This is a positive document.",
    "Negative sentiment detected.",
    "Another positive example.",
    "Negative review here."
]
labels = [1, 0, 1, 0]  # 1 for positive, 0 for negative

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(documents, labels, test_size=0.2, random_state=42)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train the Naive Bayes Classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

# Make predictions
predictions = classifier.predict(X_test_tfidf)

# Evaluate the classifier
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", report)        


要查看或添加评论,请登录

RISHABH SINGH的更多文章

社区洞察

其他会员也浏览了