Natural Language Processing with simple Classification Analysis

GitHub Link

Overview

·??????What is Natural Language processing?

·??????Natural Language processing pipeline.

·??????Simple Classification task with NLP

Text Processing

Feature Extraction

Model Building

Training and Validation

Inference

Introduction:

Ever since the cognitive revolution, communication had become an inevitable part of Homosapiens. Humans evolved through different ages, digital age started evolving in a dominant manner, ever since the computers were invented. Nowadays, technology plays an important role in everyone’s day-to-day life. Despite technology plays a dominant role in human lives, breaking down and interpreting human language is quite a challenging job. Thanks to the research in Artificial intelligence, that led us to this fascinating field of Natural Language Processing.

Natural Language Processing or NLP is a fascinating process that enables machines to read, understand and interpret meaning from human languages.

Natural Language Processing Pipeline

How does this NLP works? How can machines understand our language?

Humans took millions of years in this process but computers evolving million times faster than humans. Back to our basic question, how does this work? It’s simple, we will do some math with the texts and tell the computers only that math and get the results in the numbers again. We can convert that into our convenient language. It looks simple as I said earlier but it is a tedious process. One step at a time to understand is important in each stage of this development.

The NLP pipeline works in the following way,

The text processing, first, why do we need to process text? To answer in simple terms, taking the raw text data from any source, may be from Wikipedia or social media platforms etc., it might contain many noises. It means that the text data that we are getting will not be pure, one hundred percent correctly typed without any misspelling, when extracted from the internet without any extra letter that may belong to the program coding. To get such data directly from the internet is impossible. This is reason we need a stage called text processing where we will remove unnecessary things that we think is not important to the model.?Some of the important text processing steps are splitting the sentence into words, applying lower case to all the words, stemming, lemmatization etc.,

Once we are finished with the text processing, we will pass it on to feature extraction, where we will follow some techniques to represent these words in some numerical format which the computers can understand and customized for the model. This feature extraction has some common techniques such as bag of words, TF-IDF, One hot encoding, word embeddings, word2vec, glove. For our classification task we have used word embeddings and glove to represent the words.

?Finally, we will send it to the model where it can do the training and also predictions in the future according to type of tasks. Remember, it is a cyclic process where sometimes if the model isn’t working properly, you might go back to feature extraction and in turn to text processing until you get your results.

Simple Classification with Twitter Dataset

With this introduction, we will see how we can do a basic text classification task. Let’s take a dataset contains tweets of various users collected from the internet and labelled as hate speech or non-hate speech. We will feed this data processed and features extracted to the model for training and see how it reacts.

For the model that we are building, we are taking one of the advanced methods in the field of NLP. The Transformers architecture. To tell more about this architecture, it is introduced with the paper, “Attention is All you need”. A research paper presented by the set of google team.

We will come to this model building later. To see the complete implementation of this project, readers can view it in my GitHub repository here. First, we will see how we can tackle the text processing and feature representation.

Text Processing

Extracting the texts from the internet comes with noises, we need a way to get rid of these noises and present the text processed. There are interesting libraries available in python to do this. Some of them are Regex, BeautifulSoup. While the former replaces the text that we give them and do wide variety of tasks that are included in the text processing and the latter removes the HTML tags as we extract the data from the internet.?

The following piece of code illustrates the simple processing.

#import the required libraries
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup

def review_to_words(review):
    '''Function that takes in the word and remove html parser and other symbols.
    Args: review, the sentence in the string format
    returns: cleansed text'''
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    return text

This above snippet is short example version of how we can take our noises out of the data. Is this the only solution in the process of text processing. The answer would be no. Yes, there are lot other way to process the text. Another example would be building a dictionary of local slang words as key and their original words as values. We can replace them along the datasets by applying lambda functions. There are features like stemming and lemmatization to further processing. This process is not limited and there are millions of ways to do it.

Feature Extraction

The basic idea behind the feature extraction is to give the model with data that it can understand. The transformer model that we are going to build is based on the PyTorch, a deep learning architecture. Therefore, we will use torchtext library which is a part of PyTorch project. This library contains data processing utilities and popular datasets for NLP.

Lets define some terms before we dive into explanation

Corpus - consisting of large set of structured texts in our case the processed dataset. Tokenization - A process of breaking a sentence into a list of words in simple terms.

From the corpus, we will have to build a vocabulary dictionary in which assigning the weights to each word based on their number occurrence in the corpus.

We will introduce the term called padding and truncating. Padding is the process of adding lengths which is most predominantly zeros to a fixed sequence length for the tweets that are shorter, and truncating is to cut short the longer tweets for the same sequence length. Generally, this process is called padding.We will finally load them in the test and train iterators where the iterator will divide the data points based on the batch size that we provided.

For example, if you have 100 tweets and you give 10 batch sizes. At the end you get an iterator having 10 batches of data with each batch having 10 tweets. It’s simply sends the data points by batch while training and testing and get the results by the same. At this stage we are ready with batches of data to be send to the model. These are all will be taken care by our torchtext library.

The full code for the feature extraction is as follows,

import torch
from torch import nn
from torch.autograd import Variable
import torch.nn.functional as F
import random, tqdm, sys, math, gzip
from torchtext.legacy import data, datasets, vocab
import numpy as np
import random
import spacy


#defining the function to transform the data
def build_vocab(file_path):
    '''Function to take in the preprocessed file and transform into a iterators and word dictionaries
    Args: csv file
    returns: iterators, length of vocab, word_dict'''
    #Reproducing same results
    SEED = 2019

    #Torch
    torch.manual_seed(SEED)
    
    #Instantiate the fields
    TEXT = data.Field(tokenize='spacy',lower=True, include_lengths=True, batch_first=True)
    LABEL = data.LabelField(batch_first=True)
    #since the first column is the index, the tuple is left none 
    fields = [(None,None),('tweet', TEXT),('label', LABEL)]
    
    #load the file and build the torchtext dataset
    training_data=data.TabularDataset(path = file_path,format = 'csv',fields = fields,skip_header = True)
    
    #split the dataset into train and valid
    train_data, valid_data = training_data.split(split_ratio=0.7, random_state = random.seed(SEED))
    
    #build vocabulary
    TEXT.build_vocab(train_data,min_freq=3, vectors = "glove.6B.100d")  
    LABEL.build_vocab(train_data)  

    #check whether cuda is available
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  

    #set batch size
    BATCH_SIZE = 64

    #Load an iterator
    train_iterator, valid_iterator = data.BucketIterator.splits(
        (train_data, valid_data), 
        batch_size = BATCH_SIZE,
        sort_key = lambda x: len(x.tweet),
        sort_within_batch=True,
        device = device)
    len_text_vocab = len(TEXT.vocab)
    word_dict = TEXT.vocab.stoi
    #returns the required objects
    return train_iterator, valid_iterator, len_text_vocab, word_dict

To explain the above code in shorter, we will define Objects-Field and Label-field.

Objects-Field: This the field for the text from the dataset is used to specify pre-processing steps. Label-Field: This the special case of field object which is only used for the classification tasks.

#Instantiate the fields
    TEXT = data.Field(tokenize='spacy',lower=True, include_lengths=True, batch_first=True)
    LABEL = data.LabelField(batch_first=True)

Now instantiate the fields, which is basically to create a list of tuples in which every tuple contains object field and followed label field above. In fact, we will arrange the tuples in accordance with our columns of the csv file. Since I had index as the first column, I specified the tuple (None,None) to ignore that column.

fields = [(None,None),('tweet', TEXT),('label', LABEL)]

Once we are done with instantiating the fields, we will load the pre-processed dataset in the torchtext tabular dataset function with the specified parameters. Now it’s the time to split the dataset into train and validation dataset.

#load the file and build the torchtext dataset
    training_data=data.TabularDataset(path = file_path,format = 'csv',fields = fields,skip_header = True)
    
    #split the dataset into train and valid
    train_data, valid_data = training_data.split(split_ratio=0.7, random_state = random.seed(SEED))

The next step is to build the vocabulary for the text and convert them into integer sequences. Vocabulary contains the unique words in the entire text. Each unique word is assigned an index. Below are the parameters listed for the same

Parameters:

1.?min_freq: Ignores the words in vocabulary which has frequency less than specified one and map it to unknown token.

2. Two special tokens known as unknown, and padding will be added to the vocabulary

Unknown?token is used to handle Out of Vocabulary words.
Padding?token is used to make input sequences of same length.

#build vocabulary
    TEXT.build_vocab(train_data,min_freq=3, vectors = "glove.6B.100d")  
    LABEL.build_vocab(train_data)

Now we will prepare batches for training the model. BucketIterator forms the batches in such a way that a minimum amount of padding is required. Once, the above process is completed, we will get returns train and test iterator objects, word dictionary and length of the vocabulary. The last two will be used for later purposes.

#Load an iterator
    train_iterator, valid_iterator = data.BucketIterator.splits(
        (train_data, valid_data), 
        batch_size = BATCH_SIZE,
        sort_key = lambda x: len(x.tweet),
        sort_within_batch=True,
        device = device)
    len_text_vocab = len(TEXT.vocab)
    word_dict = TEXT.vocab.stoi
    #returns the required objects
    return train_iterator, valid_iterator, len_text_vocab, word_dict

With this the feature extraction is done. We have successfully tokenized the words, build their vocabularies and put them onto a train and test iterators.

Model Building

I assume the readers will have the basic understanding of Deep Learning and PyTorch to understand the model building. It is now time for building the architecture for binary class classification. PyTorch is used in building the transformer model.?The schematic diagram of the transformer architecture is as follows.

For the classification task, decoder part is ignored as it is mainly used for the sequence-to-sequence models. We are using sequence to label model for which we will use only the encoder part, where we will build a simple attention architecture which is embedded in different transformer blocks depends on the depth we require, thus, changing it to a multi-headed attention which is again embedded in the classification transformer architecture. .

The following explanation is going to be the simplest of the model. The detailed explanation can be found here.

The basic operation of Classification transformer model is that it is provided with the embedding size, the depth of transformer blocks, sequence length at which the length of tensors coming in, the length of vocabulary size (reason for returning from feature extraction), maximum pooling, and dropout rate.

Introducing two embedding layers, for token and positional embedding. The positional embedding is very important here as it defines the positions of each word, a specialty in transformers. The feed forward network will add these embeddings and send it through a transformer block.

The transformer blocks will take in the embeddings, sequence length, and the dropout from the classification transformer. It also has other arguments such as hidden multiplication layer and ReLu activation function. In the Transformer block the embeddings are sent through the self-attention layer on each transformer block in parallel.

The self-attention architecture takes the embeddings and converts it into a key, query, and value transformations of the same embeddings. Then it passes through the normalization layers. It is then fed forward through the hidden layer and ReLu activation function. The output of this result from multi-head attention is concatenated at the end of the transformer blocks and max pooled and SoftMaxed to give us the results. The code here is contains only Classification part. The self-attention and transformer block can be viewed in the repository.

class CTransformer(nn.Module):
    """
    Transformer for classifying sequences
    """

    def __init__(self, emb, heads, depth, seq_length, num_tokens, num_classes, max_pool=True, dropout=0.0, wide=False):
        """
        emb: Embedding dimension
        heads: nr. of attention heads
        depth: Number of transformer blocks
        seq_length: Expected maximum sequence length
        num_tokens: Number of tokens (usually words) in the vocabulary
        num_classes: Number of classes.
        max_pool: If true, use global max pooling in the last layer. If false, use global
                         average pooling.
        """
        super().__init__()

        self.num_tokens, self.max_pool = num_tokens, max_pool

        self.token_embedding = nn.Embedding(embedding_dim=emb, num_embeddings=num_tokens)
        self.pos_embedding = nn.Embedding(embedding_dim=emb, num_embeddings=seq_length)

        tblocks = []
        for i in range(depth):
            tblocks.append(
                TransformerBlock(emb=emb, heads=heads, seq_length=seq_length, mask=False, dropout=dropout))

        self.tblocks = nn.Sequential(*tblocks)

        self.toprobs = nn.Linear(emb, num_classes)

        self.do = nn.Dropout(dropout)

  
   
    def forward(self, x):
        """
        x: A batch by sequence length integer tensor of token indices.
        return: predicted log-probability vectors for each token based on the preceding tokens.
        """
        tokens = self.token_embedding(x)
        b, t, e = tokens.size()

        positions = self.pos_embedding(torch.arange(t, device=device))[None, :, :].expand(b, t, e)
        
        x = tokens + positions
        x = self.do(x)
        x = self.tblocks(x)
        x = x.max(dim=1)[0] if self.max_pool else x.mean(dim=1) # pool over the time dimension
        x = self.toprobs(x)

        return F.log_softmax(x, dim=1)

Training and Validation:

Now we are ready for training our model, we will load the batches of our train data, train the model, split the label and tweet, send the tweet inside the model and get the output (SoftMaxed value) and compare it the label, calculate the loss.

Repeat this process for the number of epochs which is number of times which we want to run a model. Get the training accuracy at the end. For the validation, we will send in the batches of data, collect the prediction and actual label batches in the list and convert them into an array so that we can use it for evaluation metrics. At the end of the validation, we will get the accuracy results. The code is as follows.

# defining a function for train loop
def train(train_loader, test_loader, num_epoch, opt):
    '''The function that takes in the iterator, number of epochs and optimizer and return the 
    classification loss andaccuracy metrics.
    Args: Train and test iterator, number of epochs, optimzer
    returns: predicted labels, actual labels'''
    seen = 0
    #initialize every epoch 
    epoch_loss = 0
    total, correction= 0.0, 0.0

    for e in range(num_epoch):     #in the range of epochs specified
        print(f'\n epoch {e}')     #print the nth of epoch
        #train the model
        model.train(True)
        #load the batch
        for batch in tqdm.tqdm(train_loader):
            opt.zero_grad()
            #specify the input and label
            input = batch.tweet[0]
            label = batch.label
            #send the input tensors to the model
            out = model(input)
            #get the result
            output = out.argmax(dim=1)
            #calculate the loss function
            loss = F.nll_loss(out, label)
            loss.backward()
            opt.step()
            seen += input.size(0)
            #loss and accuracy
            total += float(input.size(0))
            correction += float((label == output).sum().item())
            epoch_loss += loss.item()
            print('classification/train-loss', float(loss.item()), seen)
        accuracy = correction / total
        print(f'-- {"training validation"} accuracy {accuracy*100}')    
    
        with torch.no_grad():

            model.train(False)  #model train not needed
            tot, cor= 0.0, 0.0  #metrics for calculating the accuracy
            collect_pred=[]     #list the collectiong the prediction labels
            collect_label=[]    #list for the actual labels

            for batch in tqdm.tqdm(test_it):

                input = batch.tweet[0]
                label = batch.label
                out = model(input).argmax(dim=1)
                
                collect_pred.append(out.cpu().detach().numpy())   #append the prediction to the list
                collect_label.append(label.cpu().detach().numpy())  #append the actual label to the list
                tot += float(input.size(0))                         #total from the input
                cor += float((label == out).sum().item())           #the correct ones
 
            acc = cor / tot    #accuracy
            print(f'-- {"test validation"} accuracy {acc*100}')
    torch.save(model.state_dict(), 'saved_weights2.pt')    #save the model
    print("The model is saved")
    pred, label = list(collect_pred), list(collect_label)  #put the np arrays of prediction and label into the list
    pred, label = np.concatenate(pred, axis=0), np.concatenate(label, axis=0)  #concatenate all the arrays
    return pred, label    #return the prediction and labels for the evaluation
    pass

Now we are ready with model, training, validation loop and most importantly the batches of data. Once all of this is done, we will set up the hyperparameters which are embedding size, the depth, sequence length etc., to send through the model. Finally, pass in the train loop.

#setup the hyperparameters:
emb=128
heads=8
depth = 6
seq_length = 512
num_tokens = len_text_vocab
NUM_CLS = 2

#BUild the model
model = CTransformer(emb=emb, 
                    heads=heads, 
                    depth=depth, 
                    seq_length=seq_length, 
                    num_tokens=num_tokens, 
                    num_classes=NUM_CLS, 
                    max_pool=True)
if torch.cuda.is_available():
    model.cuda()
#push to cuda if available
opt = torch.optim.Adam(lr=0.0001, params=model.parameters())
#call the train loop
pred, label = train(train_it,test_it, 20, opt)

At the end of the training and validation, we got the training accuracy up to 88% and validation accuracy up to 80%. The other metrics such as precision, recall and f1 score are given in detail in the GitHub repository here.

Model Inference

I have defined a simple model inference, like how my model behaves in real world assuming twitter is using this model and someone tweeting with hateful content. The label 0 is non-hate and label 1 is for hate.

#define a function for prediction
def predict(model, sentence):
    '''Function that gives us prediction of the passed sentence
    Args: model, sentence-a string
    returns: Predictions'''
    sentence = review_to_words(sentence)
    #print(sentence)
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]  #tokenize the sentence 
    #print(tokenized)
    indexed = [word_dict[t] for t in tokenized]          #convert to integer sequence    
    #print(indexed)
    tensor = torch.LongTensor(indexed).to(device)
    #print(tensor)
    tensor = tensor.unsqueeze(1).T                             #reshape in form of batch,no. of words
    #print(tensor)
    prediction = model(tensor).argmax(dim=1)
    #print(prediction)
    return prediction.item()
#Lets use it for predicting the model

x = predict(model, "how the #altright uses  &amp; insecurity to lure men into #whitesupremacy")  
print(x)
1

The above sentence contains hateful content and our model predicted correctly that it had a hateful content. Anyone can see it in the Github here.

Conclusion

Now, when I started this article, I said the computers can understand our language and it is learning faster than humans, like million times faster. The model I have trained above was able to predict 80% of the original tweets, if it had hateful or non-hateful contents. It predicted 20% incorrectly. It is a cyclic process where I must go back to text processing to do stemming and lemmatization, replacing the slang words with original English words, do a balancing on the labels which is very important.

A little bit fine tuning of the model and data balancing will give an efficient result in identifying hate speech. Until then, I have to keep working on improving the model which is my ultimate goal. I’ll soon be coming up with the new article that focuses on data balancing techniques.

See you, until next time!

References

Transformers block by Peter Bloem, link.
Text Classification using PyTorch by Analyticsvidhya, link.
PyTorch Documentation, link.
Attention is All you need, research paper by a google team, link

Natural Language Processing with simple Classification Analysis

Manoj Kumar Thangaraj

Software Engineer @ Proofpoint

Overview

Introduction:

Natural Language Processing Pipeline

Simple Classification with Twitter Dataset

Text Processing

Feature Extraction

领英推荐

Model Building

Training and Validation:

Model Inference

Conclusion

References

社区洞察

其他会员也浏览了

Unraveling the Magic of Transformers in NLP

Natural Language Processing (NLP)

Applications Natural Language processing

Natural Language Processing - NLP: Decoding Human Communication with AI - Tools, Technologies, Use Cases, and Solutions

Natural Language Processing (NLP): Bridging the Gap Between Humans and Machines

Natural Language Processing in a nutshell

UNDERSTANDING NATURAL LANGUAGE PROCESSING (NLP) IN WEB DEVELOPMENT

Article 074 ~ Natural Language Processing in AI: Understanding Human Language

What Is the Role of Natural Language Processing in Artificial Intelligence?

Embeddings in Natural Language Processing (NLP)