How to crack your first Natural Language Processing (NLP) Hackathon?

How to crack your first Natural Language Processing (NLP) Hackathon?

In this article, I will be discussing my recent venture into the realm of natural language processing. I recently enrolled myself into a data science hackathon named Predict the News Category Hackathon organized by MachineHack. In this article, you will be able to have a look at how I approached the solution step-by-step.

What this article covers?

  • Text wrangling and Pre-processing.
  • Part of Speech Tagging.
  • Word Lemmatization and TfidfVectorizantion.
  • K-fold Cross-validation.
  • Principal Component Analysis (PCA)


Problem Statement

A newspaper contains various sections like politics, sports, movies etc. For a long time this sorting of news according to sections was done manually but now in this digital age technology like machine learning is more effective at doing the work without much effort. In, this hackathon it was required to use NLP techniques to predict which genre or category the news article will fall in.

Size of training set: 7,628 records

Size of test set: 2,748 records

FEATURES:

STORY:  A part of the main content of the article to be published as a piece of news.

SECTION: The genre/category the STORY falls in.

There are four distinct sections where each story may fall in to. The Sections are labelled as follows :

Politics: 0

Technology: 1

Entertainment: 2

Business: 3

This is how raw data looks like:

This how the data looks like


Text Wrangling and Pre-processing

A good rule of thumb for any data science project is to clean the data first. This procedure helps remove any irrelevant noise thereby preventing a model from overfitting and thus focus on important features. For any NLP problems, noises are meaningless stopwords, punctuation marks and symbols like $,&,*, @,/ etc. Let us use re.sub function which matches all regular expression class function substitute it with space. Before we begin pre-processing let us call all the necessary library required to carry out this process.

No alt text provided for this image

Below is the snippet of code I used to eliminate noise from the data.

No alt text provided for this image

Please note i have also used stopwords dictionary to remove alphabetical stops words like "but", "because", "and" etc.

Part of Speech Tagging.

Wikipedia definition of Part-of-Speech (POS) Tagging.

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech,based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc

In simple words it is efficient way of tagging each word in a sentence based on their definition and their context in a sentence. For example, in a sentence "something like you" the word "like" is a preposition and has neutral sentiment. Whereas, in a sentence "I like you" the word "like" is a verb and has a positive sentiment. POS tagging are also helpful in word lemmatization. Lets give tags to each word in the corpus using the following code:

No alt text provided for this image

Output looks something like this:

No alt text provided for this image

There are different multiple interpretations for a given word. POS tagging these words of each sentence can help us find the exact interpretation. This way we help algorithms understand the specific meaning being conveyed by a given sentence and thereby improving its performance for better prediction. This is also known as word sense disambiguation.

Word Lemmatization.

The lemmatization is the process of reducing the inflectional form of a word to its base form. For example, the word "like" is base form whereas " likes, liked, likely" are inflectional form. This methodology requires to look at morphological analysis of each word which is done using a dictionary provided by the nltk library. In the code below I have not only POS tagged each word but also mapped it to input character that wordnet lemmatizer accepts.

No alt text provided for this image

The output shows words like "granted" turn into "grant" and "structured" into "structure".

No alt text provided for this image

It is always a good practice to visualize the most common words for example with a bar-plot to find out if there is a word that exists with a very low frequency which should be removed. Shown below is a bar plot showing the most common words in my data frame.

No alt text provided for this image

After lemmatization I chose to keep the tags I obtained from POS tagging in order retain interpretation of each text. I did it using following code:

No alt text provided for this image

output:

No alt text provided for this image


TfidfVectorizantion.

Term Frequency - Inverse Documents Frequency or TfidfVectorization is a very common algorithm to convert text into a meaningful representation of numbers. It's an occurrence-based technique or in simple words is a way to represent a text by how many times a word occurs in the entire corpus. However, it is known that documents in a corpus are of different size therefore, the larger document will have a higher occurrence of the word compared to smaller ones. This is why a better representation is done by normalizing the occurrence of the word by the size of the document and it's called term-frequency.

tf(w) = doc.count(w)/total words in doc

However, certain words like 'the', 'a', 'in', etc which are so common across the documents that they suppress the weights of more meaningful words. Therefore, to reduce such effect, the term frequency is discounted by the factor called inverse document frequency.

idf(w) = log(total number of documents/ number of documents containing words (w)) hence:Tf-idf(w) = tf(w)*idf(w)

Code:

No alt text provided for this image

output:

No alt text provided for this image


K-fold Cross-validation

Cross-validation is a way of assessing the predictive performance of models on outside the sample data i.e on test data.

This method requires to split the training data into 'k' equal-sized sub-samples. Now models are trained k times on (k-1) samples while one of these samples is used interchangeably for validation. An estimated value of k number of the result is given out for each model. The advantage is that this method uses all the observations for both training and validation and each sample for validation once.

Let us divide the data into training and test data first.

No alt text provided for this image

Now we call all the models from sklearn library and write a code to get k-fold cross-validation result for each model.

code:

No alt text provided for this image

Next step would be to write a 'for' loop to get the results in order:

No alt text provided for this image

output:

No alt text provided for this image

Results are promising however there is one final touch we are supposed to do before throwing the data onto the machine learning model and that is to reduce the dimension of our data. It is an important step since this way we not only get the important features but also models get more efficient with reduced data size. In the next step, we will be using one of the many feature extraction techniques called PCA and witness how it helps improve accuracy.

Principal Component Analysis (PCA)

This technique reduces the dimension of data by removing variables which are highly correlated meanwhile retaining the variation present in the dataset, up to a maximum extent. To achieve these variables are transformed to a new set of variables called principal components (PCs) and they are orthogonal, ordered in such a way that retention of variation present in the original variables decreases as we move down the order. PCs are the eigenvectors of a covariance matrix and hence, orthogonal.

In the code below I choose to keep top 1500 features of the dataset.

No alt text provided for this image

Lets see what result we obtain after applying PCA.

Code:

No alt text provided for this image

Output:

No alt text provided for this image


Conclusion

Based on the result above we can choose which model is best for us. It is also clear that PCA did manage to improve the result to some extent. Possible areas of improvement would be to go for spelling correction and conjunction and preposition removal while data cleaning process. Another method that could be useful for better prediction would topic modelling using Latent Dirichlet Allocation (LDA).

Note: To know how I tuned model in next step by using GridSeachCV for better performance you can access to full code at my Github profile.


Deepak Mehra

Senior Software Engineer at Mphasis

5 年

Nice write up Neeraj Mehra, keep writing!

要查看或添加评论,请登录

Neeraj Mehira的更多文章

社区洞察

其他会员也浏览了