N-grams
N-gram is one of the most commonly used terms in the domain of NLP. The term has the potential of sounding very difficult but in reality, it is very easy to understand. It depicts the co-occurrence of words. N in “N-gram” is to depict the number of co-occurrences one wants to consider. N=1 is called “Unigram”, n=2 is termed as “bigram”, n=3 is “trigram” and beyond that, it's 4-gram, 5-gram, etc.
The example of unigram, bigram and trigram is presented below. The unigram is the list of single words, same as tokenizing the text. A bigram is the list tog two co-occurring words in the sentence.
Application
While writing an email, Gmail provides the suggestions for next word. N-gram is one way to do that. Let’s look at one example.
Consider bigrams, as described above it is the list of two co-occurrences words. Based on these bigrams it predicts the next word based on the last word which has been written.
Probability aspect of n-grams
Consider the following three sentences:
1. Thanks for your patience.
2. I liked your watch.
3. Resume for your reference.
Based on the above training data I want to generate suggestions for the user and fill suggest word once he writes “your“ in a sentence. If we apply bigram here, the probability of completing the sentence with the word “patience” as the suggestion is 1/3 as it is occurring one time out of total three occurrences of the word “your”. The same holds for “watch” and “reference”.
Whereas, if we use a trigram, the last three words will be considered in training. So, during testing, it will predict a word once the person writes “for your”. This reduces the ambiguity and the newly updated probability for “patience” to be predicted is ? as “for your” occurs in a total of 2 trigrams. It will be making the right prediction in half of the same cases.
N-gram codes
1. Unigram:
2. Bigram:
3. Trigram:
I hope you liked it. Stay tuned for more!