Basic Text Analysis: Tokenizers & Word Frequency
We have already been using the tokenization function of the NLTK library to divide the sentences into a set of words. The difference in the sentences comes from the fact that they are composed of different words. Therefore, it becomes essential to analyze the composition of each sentence and compare the frequencies of the words occurring in it. Though, this is not the only use case of the sentence. Different composition of the sentences makes it easier for the machine to decipher between the texts and look for a similar text.
Tokenizers
Let’s look at two types of tokenizers here: WhitespaceTokenizer and WordPunctTokenizer. We will be using the NLTK library for this purpose. These two tokenizes are have a very slight difference in the way they tokenize the sentences.
1. WhitespaceTokenizer:
Step 1: Import required libraries
Step 2: Initialize the tokenizer and it will tokenize the text based on the whitespaces
Step 3: Build vocabulary out of the text. This simply means to club the same words occurring multiple times.
2. WordPunctTokenizer: Repeating the above steps for this tokenizer as well. But, it will spit out different results.
It is pretty evident from the results that the former tokenizer considers punctuations as part of words and ends up combining those. Whereas, latter involves it in the set of words. The usage of these depends upon the model one wants to make.
Frequency distribution
Frequency distribution depicts the number of times a word has been repeated in the sentence. This can be applied to a sentence post tokenization if the words are available in the form of sentences.
Step 4: Convert the tokenizer list of text into frequency distribution by using the below-mentioned function. The result of the frequency distribution has also been attached to it.
Step 5: Plot it for the top 15 words
It can be used to get the frequency corresponding to a particular word as well.
The words which are redundant in deciphering the sentences are having a higher frequency. Therefore, the stop words play very minimally to no role in calculating similarity or differences between two texts.
Link to code: https://github.com/jyotiyadav99111/30daysofNLP/tree/main/day6
I hope you liked it. Stay tuned for more!