Basic Text Analysis: Tokenizers & Word Frequency

Basic Text Analysis: Tokenizers & Word Frequency

We have already been using the tokenization function of the NLTK library to divide the sentences into a set of words. The difference in the sentences comes from the fact that they are composed of different words. Therefore, it becomes essential to analyze the composition of each sentence and compare the frequencies of the words occurring in it. Though, this is not the only use case of the sentence. Different composition of the sentences makes it easier for the machine to decipher between the texts and look for a similar text.

Tokenizers

Let’s look at two types of tokenizers here: WhitespaceTokenizer and WordPunctTokenizer. We will be using the NLTK library for this purpose. These two tokenizes are have a very slight difference in the way they tokenize the sentences.

1.      WhitespaceTokenizer:

Step 1: Import required libraries

No alt text provided for this image

Step 2: Initialize the tokenizer and it will tokenize the text based on the whitespaces

No alt text provided for this image

Step 3: Build vocabulary out of the text. This simply means to club the same words occurring multiple times.

No alt text provided for this image

2.      WordPunctTokenizer:  Repeating the above steps for this tokenizer as well. But, it will spit out different results.

No alt text provided for this image
No alt text provided for this image

It is pretty evident from the results that the former tokenizer considers punctuations as part of words and ends up combining those. Whereas, latter involves it in the set of words. The usage of these depends upon the model one wants to make.

Frequency distribution

Frequency distribution depicts the number of times a word has been repeated in the sentence.  This can be applied to a sentence post tokenization if the words are available in the form of sentences.

Step 4: Convert the tokenizer list of text into frequency distribution by using the below-mentioned function. The result of the frequency distribution has also been attached to it.

No alt text provided for this image

Step 5: Plot it for the top 15 words

No alt text provided for this image

It can be used to get the frequency corresponding to a particular word as well.

No alt text provided for this image

The words which are redundant in deciphering the sentences are having a higher frequency. Therefore, the stop words play very minimally to no role in calculating similarity or differences between two texts.

Link to code: https://github.com/jyotiyadav99111/30daysofNLP/tree/main/day6

I hope you liked it. Stay tuned for more!

要查看或添加评论,请登录

Jyoti Y.的更多文章

  • BERT (Part -3)

    BERT (Part -3)

    In the last two articles, I have described each element of the BERT model. This article combines all the concepts of…

  • BERT (Part-2)

    BERT (Part-2)

    The paper released by Google shows two architectures of BERT: Base: It is consisting of 12 encoder layers, 12 attention…

  • BERT (Part-1)

    BERT (Part-1)

    In 2019, Google released a breakthrough in the NLP domain. It has introduced the concept that has become the…

  • Attention Based Model (Part-2)

    Attention Based Model (Part-2)

    In the previous article, we studied the issues related to long sequences faced by an RNN architecture in the case of…

  • Attention Based Model (Part-1)

    Attention Based Model (Part-1)

    In the previous articles, we have gone through some of the text mining and preliminary methods for text analysis and…

  • Recurrent Neural Network (Part - 3)

    Recurrent Neural Network (Part - 3)

    For illustration purposes, we are using the airline's review dataset. The first and foremost is to filter the data…

  • Recurrent Neural Network (Part -2)

    Recurrent Neural Network (Part -2)

    This segment describes the backpropagation of the entire RNN model. Backpropagation is when the final loss is…

  • Recurrent Neural Network (Part-1)

    Recurrent Neural Network (Part-1)

    In the entire series of NLP, we have come across many techniques like TF-IDF, word2vec, BoW. These techniques are…

  • Latent Dirichlet Allocation (Part -3)

    Latent Dirichlet Allocation (Part -3)

    The theory and implementation of the model have been provided in the last two articles. This article is primarily about…

  • Latent Dirichlet Allocation (Part 2)

    Latent Dirichlet Allocation (Part 2)

    The theory behind the entire model has been described in the last article. This article puts light on the code part of…

社区洞察

其他会员也浏览了