Natural Language Processing with simple Classification Analysis
Overview
·??????What is Natural Language processing?
·??????Natural Language processing pipeline.
·??????Simple Classification task with NLP
Text Processing
Feature Extraction
Model Building
Training and Validation
Inference
Introduction:
Ever since the cognitive revolution, communication had become an inevitable part of Homosapiens. Humans evolved through different ages, digital age started evolving in a dominant manner, ever since the computers were invented. Nowadays, technology plays an important role in everyone’s day-to-day life. Despite technology plays a dominant role in human lives, breaking down and interpreting human language is quite a challenging job. Thanks to the research in Artificial intelligence, that led us to this fascinating field of Natural Language Processing.
Natural Language Processing or NLP is a fascinating process that enables machines to read, understand and interpret meaning from human languages.
Natural Language Processing Pipeline
How does this NLP works? How can machines understand our language?
Humans took millions of years in this process but computers evolving million times faster than humans. Back to our basic question, how does this work? It’s simple, we will do some math with the texts and tell the computers only that math and get the results in the numbers again. We can convert that into our convenient language. It looks simple as I said earlier but it is a tedious process. One step at a time to understand is important in each stage of this development.
The NLP pipeline works in the following way,
The text processing, first, why do we need to process text? To answer in simple terms, taking the raw text data from any source, may be from Wikipedia or social media platforms etc., it might contain many noises. It means that the text data that we are getting will not be pure, one hundred percent correctly typed without any misspelling, when extracted from the internet without any extra letter that may belong to the program coding. To get such data directly from the internet is impossible. This is reason we need a stage called text processing where we will remove unnecessary things that we think is not important to the model.?Some of the important text processing steps are splitting the sentence into words, applying lower case to all the words, stemming, lemmatization etc.,
Once we are finished with the text processing, we will pass it on to feature extraction, where we will follow some techniques to represent these words in some numerical format which the computers can understand and customized for the model. This feature extraction has some common techniques such as bag of words, TF-IDF, One hot encoding, word embeddings, word2vec, glove. For our classification task we have used word embeddings and glove to represent the words.
?Finally, we will send it to the model where it can do the training and also predictions in the future according to type of tasks. Remember, it is a cyclic process where sometimes if the model isn’t working properly, you might go back to feature extraction and in turn to text processing until you get your results.
Simple Classification with Twitter Dataset
With this introduction, we will see how we can do a basic text classification task. Let’s take a dataset contains tweets of various users collected from the internet and labelled as hate speech or non-hate speech. We will feed this data processed and features extracted to the model for training and see how it reacts.
For the model that we are building, we are taking one of the advanced methods in the field of NLP. The Transformers architecture. To tell more about this architecture, it is introduced with the paper, “Attention is All you need”. A research paper presented by the set of google team.
We will come to this model building later. To see the complete implementation of this project, readers can view it in my GitHub repository here. First, we will see how we can tackle the text processing and feature representation.
Text Processing
Extracting the texts from the internet comes with noises, we need a way to get rid of these noises and present the text processed. There are interesting libraries available in python to do this. Some of them are Regex, BeautifulSoup. While the former replaces the text that we give them and do wide variety of tasks that are included in the text processing and the latter removes the HTML tags as we extract the data from the internet.?
The following piece of code illustrates the simple processing.
#import the required libraries
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
def review_to_words(review):
'''Function that takes in the word and remove html parser and other symbols.
Args: review, the sentence in the string format
returns: cleansed text'''
text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
return text
This above snippet is short example version of how we can take our noises out of the data. Is this the only solution in the process of text processing. The answer would be no. Yes, there are lot other way to process the text. Another example would be building a dictionary of local slang words as key and their original words as values. We can replace them along the datasets by applying lambda functions. There are features like stemming and lemmatization to further processing. This process is not limited and there are millions of ways to do it.
Feature Extraction
The basic idea behind the feature extraction is to give the model with data that it can understand. The transformer model that we are going to build is based on the PyTorch, a deep learning architecture. Therefore, we will use torchtext library which is a part of PyTorch project. This library contains data processing utilities and popular datasets for NLP.
Lets define some terms before we dive into explanation
Corpus - consisting of large set of structured texts in our case the processed dataset. Tokenization - A process of breaking a sentence into a list of words in simple terms.
From the corpus, we will have to build a vocabulary dictionary in which assigning the weights to each word based on their number occurrence in the corpus.
We will introduce the term called padding and truncating. Padding is the process of adding lengths which is most predominantly zeros to a fixed sequence length for the tweets that are shorter, and truncating is to cut short the longer tweets for the same sequence length. Generally, this process is called padding.We will finally load them in the test and train iterators where the iterator will divide the data points based on the batch size that we provided.
For example, if you have 100 tweets and you give 10 batch sizes. At the end you get an iterator having 10 batches of data with each batch having 10 tweets. It’s simply sends the data points by batch while training and testing and get the results by the same. At this stage we are ready with batches of data to be send to the model. These are all will be taken care by our torchtext library.
The full code for the feature extraction is as follows,
import torch
from torch import nn
from torch.autograd import Variable
import torch.nn.functional as F
import random, tqdm, sys, math, gzip
from torchtext.legacy import data, datasets, vocab
import numpy as np
import random
import spacy
#defining the function to transform the data
def build_vocab(file_path):
'''Function to take in the preprocessed file and transform into a iterators and word dictionaries
Args: csv file
returns: iterators, length of vocab, word_dict'''
#Reproducing same results
SEED = 2019
#Torch
torch.manual_seed(SEED)
#Instantiate the fields
TEXT = data.Field(tokenize='spacy',lower=True, include_lengths=True, batch_first=True)
LABEL = data.LabelField(batch_first=True)
#since the first column is the index, the tuple is left none
fields = [(None,None),('tweet', TEXT),('label', LABEL)]
#load the file and build the torchtext dataset
training_data=data.TabularDataset(path = file_path,format = 'csv',fields = fields,skip_header = True)
#split the dataset into train and valid
train_data, valid_data = training_data.split(split_ratio=0.7, random_state = random.seed(SEED))
#build vocabulary
TEXT.build_vocab(train_data,min_freq=3, vectors = "glove.6B.100d")
LABEL.build_vocab(train_data)
#check whether cuda is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#set batch size
BATCH_SIZE = 64
#Load an iterator
train_iterator, valid_iterator = data.BucketIterator.splits(
(train_data, valid_data),
batch_size = BATCH_SIZE,
sort_key = lambda x: len(x.tweet),
sort_within_batch=True,
device = device)
len_text_vocab = len(TEXT.vocab)
word_dict = TEXT.vocab.stoi
#returns the required objects
return train_iterator, valid_iterator, len_text_vocab, word_dict
To explain the above code in shorter, we will define Objects-Field and Label-field.
Objects-Field: This the field for the text from the dataset is used to specify pre-processing steps. Label-Field: This the special case of field object which is only used for the classification tasks.
领英推荐
#Instantiate the fields
TEXT = data.Field(tokenize='spacy',lower=True, include_lengths=True, batch_first=True)
LABEL = data.LabelField(batch_first=True)
Now instantiate the fields, which is basically to create a list of tuples in which every tuple contains object field and followed label field above. In fact, we will arrange the tuples in accordance with our columns of the csv file. Since I had index as the first column, I specified the tuple (None,None) to ignore that column.
fields = [(None,None),('tweet', TEXT),('label', LABEL)]
Once we are done with instantiating the fields, we will load the pre-processed dataset in the torchtext tabular dataset function with the specified parameters. Now it’s the time to split the dataset into train and validation dataset.
#load the file and build the torchtext dataset
training_data=data.TabularDataset(path = file_path,format = 'csv',fields = fields,skip_header = True)
#split the dataset into train and valid
train_data, valid_data = training_data.split(split_ratio=0.7, random_state = random.seed(SEED))
The next step is to build the vocabulary for the text and convert them into integer sequences. Vocabulary contains the unique words in the entire text. Each unique word is assigned an index. Below are the parameters listed for the same
Parameters:
1.?min_freq: Ignores the words in vocabulary which has frequency less than specified one and map it to unknown token.
2. Two special tokens known as unknown, and padding will be added to the vocabulary
#build vocabulary
TEXT.build_vocab(train_data,min_freq=3, vectors = "glove.6B.100d")
LABEL.build_vocab(train_data)
Now we will prepare batches for training the model. BucketIterator forms the batches in such a way that a minimum amount of padding is required. Once, the above process is completed, we will get returns train and test iterator objects, word dictionary and length of the vocabulary. The last two will be used for later purposes.
#Load an iterator
train_iterator, valid_iterator = data.BucketIterator.splits(
(train_data, valid_data),
batch_size = BATCH_SIZE,
sort_key = lambda x: len(x.tweet),
sort_within_batch=True,
device = device)
len_text_vocab = len(TEXT.vocab)
word_dict = TEXT.vocab.stoi
#returns the required objects
return train_iterator, valid_iterator, len_text_vocab, word_dict
With this the feature extraction is done. We have successfully tokenized the words, build their vocabularies and put them onto a train and test iterators.
Model Building
I assume the readers will have the basic understanding of Deep Learning and PyTorch to understand the model building. It is now time for building the architecture for binary class classification. PyTorch is used in building the transformer model.?The schematic diagram of the transformer architecture is as follows.
For the classification task, decoder part is ignored as it is mainly used for the sequence-to-sequence models. We are using sequence to label model for which we will use only the encoder part, where we will build a simple attention architecture which is embedded in different transformer blocks depends on the depth we require, thus, changing it to a multi-headed attention which is again embedded in the classification transformer architecture. .
The following explanation is going to be the simplest of the model. The detailed explanation can be found here.
The basic operation of Classification transformer model is that it is provided with the embedding size, the depth of transformer blocks, sequence length at which the length of tensors coming in, the length of vocabulary size (reason for returning from feature extraction), maximum pooling, and dropout rate.
Introducing two embedding layers, for token and positional embedding. The positional embedding is very important here as it defines the positions of each word, a specialty in transformers. The feed forward network will add these embeddings and send it through a transformer block.
The transformer blocks will take in the embeddings, sequence length, and the dropout from the classification transformer. It also has other arguments such as hidden multiplication layer and ReLu activation function. In the Transformer block the embeddings are sent through the self-attention layer on each transformer block in parallel.
The self-attention architecture takes the embeddings and converts it into a key, query, and value transformations of the same embeddings. Then it passes through the normalization layers. It is then fed forward through the hidden layer and ReLu activation function. The output of this result from multi-head attention is concatenated at the end of the transformer blocks and max pooled and SoftMaxed to give us the results. The code here is contains only Classification part. The self-attention and transformer block can be viewed in the repository.
class CTransformer(nn.Module):
"""
Transformer for classifying sequences
"""
def __init__(self, emb, heads, depth, seq_length, num_tokens, num_classes, max_pool=True, dropout=0.0, wide=False):
"""
emb: Embedding dimension
heads: nr. of attention heads
depth: Number of transformer blocks
seq_length: Expected maximum sequence length
num_tokens: Number of tokens (usually words) in the vocabulary
num_classes: Number of classes.
max_pool: If true, use global max pooling in the last layer. If false, use global
average pooling.
"""
super().__init__()
self.num_tokens, self.max_pool = num_tokens, max_pool
self.token_embedding = nn.Embedding(embedding_dim=emb, num_embeddings=num_tokens)
self.pos_embedding = nn.Embedding(embedding_dim=emb, num_embeddings=seq_length)
tblocks = []
for i in range(depth):
tblocks.append(
TransformerBlock(emb=emb, heads=heads, seq_length=seq_length, mask=False, dropout=dropout))
self.tblocks = nn.Sequential(*tblocks)
self.toprobs = nn.Linear(emb, num_classes)
self.do = nn.Dropout(dropout)
def forward(self, x):
"""
x: A batch by sequence length integer tensor of token indices.
return: predicted log-probability vectors for each token based on the preceding tokens.
"""
tokens = self.token_embedding(x)
b, t, e = tokens.size()
positions = self.pos_embedding(torch.arange(t, device=device))[None, :, :].expand(b, t, e)
x = tokens + positions
x = self.do(x)
x = self.tblocks(x)
x = x.max(dim=1)[0] if self.max_pool else x.mean(dim=1) # pool over the time dimension
x = self.toprobs(x)
return F.log_softmax(x, dim=1)
Training and Validation:
Now we are ready for training our model, we will load the batches of our train data, train the model, split the label and tweet, send the tweet inside the model and get the output (SoftMaxed value) and compare it the label, calculate the loss.
Repeat this process for the number of epochs which is number of times which we want to run a model. Get the training accuracy at the end. For the validation, we will send in the batches of data, collect the prediction and actual label batches in the list and convert them into an array so that we can use it for evaluation metrics. At the end of the validation, we will get the accuracy results. The code is as follows.
# defining a function for train loop
def train(train_loader, test_loader, num_epoch, opt):
'''The function that takes in the iterator, number of epochs and optimizer and return the
classification loss andaccuracy metrics.
Args: Train and test iterator, number of epochs, optimzer
returns: predicted labels, actual labels'''
seen = 0
#initialize every epoch
epoch_loss = 0
total, correction= 0.0, 0.0
for e in range(num_epoch): #in the range of epochs specified
print(f'\n epoch {e}') #print the nth of epoch
#train the model
model.train(True)
#load the batch
for batch in tqdm.tqdm(train_loader):
opt.zero_grad()
#specify the input and label
input = batch.tweet[0]
label = batch.label
#send the input tensors to the model
out = model(input)
#get the result
output = out.argmax(dim=1)
#calculate the loss function
loss = F.nll_loss(out, label)
loss.backward()
opt.step()
seen += input.size(0)
#loss and accuracy
total += float(input.size(0))
correction += float((label == output).sum().item())
epoch_loss += loss.item()
print('classification/train-loss', float(loss.item()), seen)
accuracy = correction / total
print(f'-- {"training validation"} accuracy {accuracy*100}')
with torch.no_grad():
model.train(False) #model train not needed
tot, cor= 0.0, 0.0 #metrics for calculating the accuracy
collect_pred=[] #list the collectiong the prediction labels
collect_label=[] #list for the actual labels
for batch in tqdm.tqdm(test_it):
input = batch.tweet[0]
label = batch.label
out = model(input).argmax(dim=1)
collect_pred.append(out.cpu().detach().numpy()) #append the prediction to the list
collect_label.append(label.cpu().detach().numpy()) #append the actual label to the list
tot += float(input.size(0)) #total from the input
cor += float((label == out).sum().item()) #the correct ones
acc = cor / tot #accuracy
print(f'-- {"test validation"} accuracy {acc*100}')
torch.save(model.state_dict(), 'saved_weights2.pt') #save the model
print("The model is saved")
pred, label = list(collect_pred), list(collect_label) #put the np arrays of prediction and label into the list
pred, label = np.concatenate(pred, axis=0), np.concatenate(label, axis=0) #concatenate all the arrays
return pred, label #return the prediction and labels for the evaluation
pass
Now we are ready with model, training, validation loop and most importantly the batches of data. Once all of this is done, we will set up the hyperparameters which are embedding size, the depth, sequence length etc., to send through the model. Finally, pass in the train loop.
#setup the hyperparameters:
emb=128
heads=8
depth = 6
seq_length = 512
num_tokens = len_text_vocab
NUM_CLS = 2
#BUild the model
model = CTransformer(emb=emb,
heads=heads,
depth=depth,
seq_length=seq_length,
num_tokens=num_tokens,
num_classes=NUM_CLS,
max_pool=True)
if torch.cuda.is_available():
model.cuda()
#push to cuda if available
opt = torch.optim.Adam(lr=0.0001, params=model.parameters())
#call the train loop
pred, label = train(train_it,test_it, 20, opt)
At the end of the training and validation, we got the training accuracy up to 88% and validation accuracy up to 80%. The other metrics such as precision, recall and f1 score are given in detail in the GitHub repository here.
Model Inference
I have defined a simple model inference, like how my model behaves in real world assuming twitter is using this model and someone tweeting with hateful content. The label 0 is non-hate and label 1 is for hate.
#define a function for prediction
def predict(model, sentence):
'''Function that gives us prediction of the passed sentence
Args: model, sentence-a string
returns: Predictions'''
sentence = review_to_words(sentence)
#print(sentence)
tokenized = [tok.text for tok in nlp.tokenizer(sentence)] #tokenize the sentence
#print(tokenized)
indexed = [word_dict[t] for t in tokenized] #convert to integer sequence
#print(indexed)
tensor = torch.LongTensor(indexed).to(device)
#print(tensor)
tensor = tensor.unsqueeze(1).T #reshape in form of batch,no. of words
#print(tensor)
prediction = model(tensor).argmax(dim=1)
#print(prediction)
return prediction.item()
#Lets use it for predicting the model
x = predict(model, "how the #altright uses & insecurity to lure men into #whitesupremacy")
print(x)
1
The above sentence contains hateful content and our model predicted correctly that it had a hateful content. Anyone can see it in the Github here.
Conclusion
Now, when I started this article, I said the computers can understand our language and it is learning faster than humans, like million times faster. The model I have trained above was able to predict 80% of the original tweets, if it had hateful or non-hateful contents. It predicted 20% incorrectly. It is a cyclic process where I must go back to text processing to do stemming and lemmatization, replacing the slang words with original English words, do a balancing on the labels which is very important.
A little bit fine tuning of the model and data balancing will give an efficient result in identifying hate speech. Until then, I have to keep working on improving the model which is my ultimate goal. I’ll soon be coming up with the new article that focuses on data balancing techniques.
See you, until next time!
References