登录查看更多内容

N-grams

Jyoti Y.

Cyber Security Data Scientist at Microsoft | Ex-Blockchain.com | Ex-EXL | Ex-Lucidian | DSE | SRCC

发布日期: 2021年6月7日

N-gram is one of the most commonly used terms in the domain of NLP. The term has the potential of sounding very difficult but in reality, it is very easy to understand. It depicts the co-occurrence of words. N in “N-gram” is to depict the number of co-occurrences one wants to consider. N=1 is called “Unigram”, n=2 is termed as “bigram”, n=3 is “trigram” and beyond that, it's 4-gram, 5-gram, etc.

The example of unigram, bigram and trigram is presented below. The unigram is the list of single words, same as tokenizing the text. A bigram is the list tog two co-occurring words in the sentence.

Application

While writing an email, Gmail provides the suggestions for next word. N-gram is one way to do that. Let’s look at one example.

Consider bigrams, as described above it is the list of two co-occurrences words. Based on these bigrams it predicts the next word based on the last word which has been written.

Probability aspect of n-grams

Consider the following three sentences:

1. Thanks for your patience.

2. I liked your watch.

3. Resume for your reference.

Based on the above training data I want to generate suggestions for the user and fill suggest word once he writes “your“ in a sentence. If we apply bigram here, the probability of completing the sentence with the word “patience” as the suggestion is 1/3 as it is occurring one time out of total three occurrences of the word “your”. The same holds for “watch” and “reference”.

Whereas, if we use a trigram, the last three words will be considered in training. So, during testing, it will predict a word once the person writes “for your”. This reduces the ambiguity and the newly updated probability for “patience” to be predicted is ? as “for your” occurs in a total of 2 trigrams. It will be making the right prediction in half of the same cases.

N-gram codes

1. Unigram:

2. Bigram:

3. Trigram:

I hope you liked it. Stay tuned for more!

要查看或添加评论，请登录

Jyoti Y.的更多文章

BERT (Part -3)

2021年6月30日

BERT (Part -3)

In the last two articles, I have described each element of the BERT model. This article combines all the concepts of…
BERT (Part-2)

2021年6月29日

BERT (Part-2)

The paper released by Google shows two architectures of BERT: Base: It is consisting of 12 encoder layers, 12 attention…
BERT (Part-1)

2021年6月28日

BERT (Part-1)

In 2019, Google released a breakthrough in the NLP domain. It has introduced the concept that has become the…
Attention Based Model (Part-2)

2021年6月26日

Attention Based Model (Part-2)

In the previous article, we studied the issues related to long sequences faced by an RNN architecture in the case of…
Attention Based Model (Part-1)

2021年6月25日

Attention Based Model (Part-1)

In the previous articles, we have gone through some of the text mining and preliminary methods for text analysis and…
Recurrent Neural Network (Part - 3)

2021年6月23日

Recurrent Neural Network (Part - 3)

For illustration purposes, we are using the airline's review dataset. The first and foremost is to filter the data…
Recurrent Neural Network (Part -2)

2021年6月22日

Recurrent Neural Network (Part -2)

This segment describes the backpropagation of the entire RNN model. Backpropagation is when the final loss is…
Recurrent Neural Network (Part-1)

2021年6月21日

Recurrent Neural Network (Part-1)

In the entire series of NLP, we have come across many techniques like TF-IDF, word2vec, BoW. These techniques are…
Latent Dirichlet Allocation (Part -3)

2021年6月20日

Latent Dirichlet Allocation (Part -3)

The theory and implementation of the model have been provided in the last two articles. This article is primarily about…
Latent Dirichlet Allocation (Part 2)

2021年6月19日

Latent Dirichlet Allocation (Part 2)

The theory behind the entire model has been described in the last article. This article puts light on the code part of…

See all articles

N-grams

Jyoti Y.

Cyber Security Data Scientist at Microsoft | Ex-Blockchain.com | Ex-EXL | Ex-Lucidian | DSE | SRCC

Application

Probability aspect of n-grams

N-gram codes

Jyoti Y.的更多文章

社区洞察

其他会员也浏览了

The best models of the first NLP hackathon in Spanish

What Have We Seen In NLP So Far?

Top LLM Papers of the week (July Week 4, 2024)

Five phases of NLP and how to incorporate them into your SEO journey

Which End of the Telescope Are You Seeing From?

Your social media posts VS Controversial CEOs' - with NLP

Truthful insight into LLM benchmark

Revolutionizing News Consumption: How NLP-Powered Summarizers have Changed the Game

The Hidden Dictionary Behind Your BERT Model: Understanding Vocabulary and the vocab.txt file in NLP

Using NLP for Creativity and Innovation

Application

Probability aspect of n-grams

N-gram codes

Jyoti Y.的更多文章

BERT (Part -3)

BERT (Part-2)

BERT (Part-1)

Attention Based Model (Part-2)

Attention Based Model (Part-1)

Recurrent Neural Network (Part - 3)

Recurrent Neural Network (Part -2)

Recurrent Neural Network (Part-1)

Latent Dirichlet Allocation (Part -3)

Latent Dirichlet Allocation (Part 2)

社区洞察

其他会员也浏览了

The best models of the first NLP hackathon in Spanish

What Have We Seen In NLP So Far?

Top LLM Papers of the week (July Week 4, 2024)

Five phases of NLP and how to incorporate them into your SEO journey

Which End of the Telescope Are You Seeing From?

Your social media posts VS Controversial CEOs' - with NLP

Truthful insight into LLM benchmark

Revolutionizing News Consumption: How NLP-Powered Summarizers have Changed the Game

The Hidden Dictionary Behind Your BERT Model: Understanding Vocabulary and the vocab.txt file in NLP

Using NLP for Creativity and Innovation