登录查看更多内容

Basic Text Analysis: Tokenizers & Word Frequency

Jyoti Y.

Cyber Security Data Scientist at Microsoft | Ex-Blockchain.com | Ex-EXL | Ex-Lucidian | DSE | SRCC

发布日期: 2021年6月6日

We have already been using the tokenization function of the NLTK library to divide the sentences into a set of words. The difference in the sentences comes from the fact that they are composed of different words. Therefore, it becomes essential to analyze the composition of each sentence and compare the frequencies of the words occurring in it. Though, this is not the only use case of the sentence. Different composition of the sentences makes it easier for the machine to decipher between the texts and look for a similar text.

Tokenizers

Let’s look at two types of tokenizers here: WhitespaceTokenizer and WordPunctTokenizer. We will be using the NLTK library for this purpose. These two tokenizes are have a very slight difference in the way they tokenize the sentences.

1. WhitespaceTokenizer:

Step 1: Import required libraries

Step 2: Initialize the tokenizer and it will tokenize the text based on the whitespaces

Step 3: Build vocabulary out of the text. This simply means to club the same words occurring multiple times.

2. WordPunctTokenizer: Repeating the above steps for this tokenizer as well. But, it will spit out different results.

It is pretty evident from the results that the former tokenizer considers punctuations as part of words and ends up combining those. Whereas, latter involves it in the set of words. The usage of these depends upon the model one wants to make.

Frequency distribution

Frequency distribution depicts the number of times a word has been repeated in the sentence. This can be applied to a sentence post tokenization if the words are available in the form of sentences.

Step 4: Convert the tokenizer list of text into frequency distribution by using the below-mentioned function. The result of the frequency distribution has also been attached to it.

Step 5: Plot it for the top 15 words

It can be used to get the frequency corresponding to a particular word as well.

The words which are redundant in deciphering the sentences are having a higher frequency. Therefore, the stop words play very minimally to no role in calculating similarity or differences between two texts.

Link to code: https://github.com/jyotiyadav99111/30daysofNLP/tree/main/day6

I hope you liked it. Stay tuned for more!

要查看或添加评论，请登录

Jyoti Y.的更多文章

BERT (Part -3)

2021年6月30日

BERT (Part -3)

In the last two articles, I have described each element of the BERT model. This article combines all the concepts of…
BERT (Part-2)

2021年6月29日

BERT (Part-2)

The paper released by Google shows two architectures of BERT: Base: It is consisting of 12 encoder layers, 12 attention…
BERT (Part-1)

2021年6月28日

BERT (Part-1)

In 2019, Google released a breakthrough in the NLP domain. It has introduced the concept that has become the…
Attention Based Model (Part-2)

2021年6月26日

Attention Based Model (Part-2)

In the previous article, we studied the issues related to long sequences faced by an RNN architecture in the case of…
Attention Based Model (Part-1)

2021年6月25日

Attention Based Model (Part-1)

In the previous articles, we have gone through some of the text mining and preliminary methods for text analysis and…
Recurrent Neural Network (Part - 3)

2021年6月23日

Recurrent Neural Network (Part - 3)

For illustration purposes, we are using the airline's review dataset. The first and foremost is to filter the data…
Recurrent Neural Network (Part -2)

2021年6月22日

Recurrent Neural Network (Part -2)

This segment describes the backpropagation of the entire RNN model. Backpropagation is when the final loss is…
Recurrent Neural Network (Part-1)

2021年6月21日

Recurrent Neural Network (Part-1)

In the entire series of NLP, we have come across many techniques like TF-IDF, word2vec, BoW. These techniques are…
Latent Dirichlet Allocation (Part -3)

2021年6月20日

Latent Dirichlet Allocation (Part -3)

The theory and implementation of the model have been provided in the last two articles. This article is primarily about…
Latent Dirichlet Allocation (Part 2)

2021年6月19日

Latent Dirichlet Allocation (Part 2)

The theory behind the entire model has been described in the last article. This article puts light on the code part of…

See all articles

Basic Text Analysis: Tokenizers & Word Frequency

Jyoti Y.

Cyber Security Data Scientist at Microsoft | Ex-Blockchain.com | Ex-EXL | Ex-Lucidian | DSE | SRCC

Tokenizers

Frequency distribution

Jyoti Y.的更多文章

社区洞察

其他会员也浏览了

Top LLM Papers of the Week (August Week 3, 2024)

Addressing the shortcomings of Naive RAG Pipelines

Neo4j Graph Tech Weekly

Top RAG Papers of the Week (August Week 3, 2024)

Words are about the World and lost Third World

Binary Search Demystified

Fine-Tuning LLMs on Raw Text

Leveling Up the Dev Dungeon with ML Kit's Text Recognition

Top Stories, Sep 12-18: Top Algorithms; 7 Steps to Mastering Apache Spark 2.0

Top August Stories: The 10 Algorithms Machine Learning Engineers Need to Know; How to Become a Data Scientist

Tokenizers

Frequency distribution

Jyoti Y.的更多文章

BERT (Part -3)

BERT (Part-2)

BERT (Part-1)

Attention Based Model (Part-2)

Attention Based Model (Part-1)

Recurrent Neural Network (Part - 3)

Recurrent Neural Network (Part -2)

Recurrent Neural Network (Part-1)

Latent Dirichlet Allocation (Part -3)

Latent Dirichlet Allocation (Part 2)

社区洞察

其他会员也浏览了

Top LLM Papers of the Week (August Week 3, 2024)

Addressing the shortcomings of Naive RAG Pipelines

Neo4j Graph Tech Weekly

Top RAG Papers of the Week (August Week 3, 2024)

Words are about the World and lost Third World

Binary Search Demystified

Fine-Tuning LLMs on Raw Text

Leveling Up the Dev Dungeon with ML Kit's Text Recognition

Top Stories, Sep 12-18: Top Algorithms; 7 Steps to Mastering Apache Spark 2.0

Top August Stories: The 10 Algorithms Machine Learning Engineers Need to Know; How to Become a Data Scientist