登录查看更多内容

Natural Language Processing - Begin the learning naturally!

L Koushik Kumar

Head of AI Practice at Aptagrim limited

发布日期: 2021年1月23日

Natural language processing is all about making computers understand human language, and intern generates human language. Off late, there have been many applications developed through NLP, these include Chatbots, language translation, captioning, text generation, and a lot more. The journey of NLP started in the early 20th century with syntactic structures and evolved till Dall-E, a recent release in 2021 by OpenAI. Dall-E can understand the linguistic input and draw accurate pictures.

To understand any model implementation, we'd need to have the basics right. It can even be how inputs are processed and sent to the model. We will now be walking through the very basics of NLP such as Tokenization, Stemming, Lemmatization, and use of stop words. I am not trying to provide you with textbook definitions, it's merely the understanding and application. The blog series will continue on Bag of words, TF-IDF, Word2Vec, spam - ham classifiers, NLP through LSTMs, Attention-based models, and transformers! Let keep the enthusiasm of learning and dive into the basics now.

Tokenization

Tokenization is splitting a paragraph, sentence, or even an entire document into smaller units such as individual sentences, words, or terms. The smaller units here are called Tokens. Consider the sentence 'I love my cat'. Performing tokenization on this sentence should provide a list of words - ['I', 'love', 'my' 'cat']. Tokenization is broadly classified into two types: Word tokenization and Sentence tokenization.

1. Sentence Tokenization:

Performing sentence tokenization gives a list of sentences in the given paragraph or document. For example, the paragraph - "I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories." on sentence tokenization gives list of sentences

import nltk 
nltk.download() # Download all required packages for text processing 

paragraph = """I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.""" 

#If you are wondering why three quotes - Python's triple quotes helps by allowing strings to span multiple lines, 
#that includes new line characters, tab spaces or any special characters. 

sentences = nltk.sent_tokenize(paragraph)

Output:

['I am honored to be with you today at your commencement from one of the finest universities in the world.', 
'I never graduated from college.', 
'Truth be told, this is the closest I’ve ever gotten to a college graduation.', 
'Today I want to tell you three stories from my life.', 
'That’s it.', 
'No big deal.', 
'Just three stories.']

2. Word Tokenization:

Performing word tokenization gives a list of words in the given paragraph or document. It is the same as sentence tokenization, except that, instead of sentences, we'd get a list of words available in the sentence.

Consider the same sentence taken as input, and lets' see the output upon performing word tokenization

import nltk 
#nltk.download() # Download all required packages for text processing 

paragraph = """I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.""" 

words = nltk.word_tokenize(paragraph)

Output:

['I', 'am', 'honored', 'to', 'be', 'with', 'you', 'today', 'at', 'your', 'commencement', 'from', 'one', 'of', 'the', 'finest', 'universities', 'in', 'the', 'world', '.', 'I', 'never', 'graduated', 'from', 'college', '.', 'Truth', 'be', 'told', ',', 'this', 'is', 'the', 'closest', 'I', '’', 've', 'ever', 'gotten', 'to', 'a', 'college', 'graduation', '.', 'Today', 'I', 'want', 'to', 'tell', 'you', 'three', 'stories', 'from', 'my', 'life', '.', 'That', '’', 's', 'it', '.', 'No', 'big', 'deal', '.', 'Just', 'three', 'stories', '.']

Few points to note:

We've got every word from the paragraph, and this includes punctuations and special characters as well
As humans, we'd need the sentence completely to get the meaning, however, we can get the context of the sentence with the keywords in the sentence. Prepositions, conjunctions, articles can be taken out of the sentence, and still, it gives us the context.
Plurals, verb forms, adverbs can be reduced to root words - the root words can help in giving the context without the need for their extensions.

We can address these points with help of few techniques:

a. Regular expressions:

With help of regular expressions, we can drop the numbers, special characters, punctuations, or any other word we can represent in form of the regular expressions.

Ex: sentences = re.sub('[^a-zA-Z]', ' ', sentences)

The first argument takes in the regular expression, here mentioning to only consider all alphabets.
Second argument specifies to replace any word or character that contains other than alphabets with empty string ''
Third argument is the data on which regular expressions should be applied

b. Stop words:

NLTK library has a list of stop words specified for each language it supports. Looping the words or sentences we have in the data through the stop words set, we can perform the required operation to drop the stop words.

c. Stemming and Lemmatization:

This is the process of reducing words to their word stems. Stemming reduces the words its root, which may or may not give complete meaning to the user. Whereas Lemmatization generates the root word that is part of the language.

Implementation for Regular Expressions, Stop words and Stemming / Lemmatization:

import nltk from nltk.stem 
import PorterStemmer 
from nltk.corpus import stopwords

paragraph = """I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories."""

sentences = nltk.sent_tokenize(paragraph) 

words = nltk.word_tokenize(paragraph) 

stemmer = PorterStemmer()

# Stemming
for i in range(len(sentences)): 
    sentences[i] = re.sub('[^a-zA-Z]', ' ', sentences[i]) 
    words = nltk.word_tokenize(sentences[i]) 
    words = [stemmer.stem(word) for word in words if word not in 
      set(stopwords.words('english'))] 
    sentences[i] = ' '.join(words)   

print(sentences)

The final sentence we'd have now, will not have any stop words.

['I honor today commenc one finest univers world .', 
'I never graduat colleg .', 
'truth told , closest I ’ ever gotten colleg graduat .', 
'today I want tell three stori life .', 
'that ’ .', 
'No big deal .', 
'just three stori .']

If we observe closely, there are words that do not give the complete meaning. Eg: Commenc → Commencement ; univers → Universities ; colleg → College ; Stori → Stories

These are root words obtained through stemming. As mentioned, one major problem we have with stemming is that the produced intermediate representation of a word may not give a complete meaning. Let us check the lemmatization route!

import nltk 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords

paragraph = """I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories."""

sentences = nltk.sent_tokenize(paragraph)

words = nltk.word_tokenize(paragraph)

lemmatizer = WordNetLemmatizer()

# Stemming
for i in range(len(sentences)):
  sentences[i] = re.sub('[^a-zA-Z]', ' ', sentences[i])
  words = nltk.word_tokenize(sentences[i])
  words = [lemmatizer.lemmatize(word) for word in words if word not in 
    set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)   

print(sentences)

The output here will have root words that are meaningful.

['I honor today commenc one finest univers world .', 
'I never graduat colleg .', 
'truth told , closest I ’ ever gotten colleg graduat .', 
'today I want tell three stori life .', 
'that ’ .', 
'No big deal .', 
'just three stori .']

Hope you got some meaningful insights through this blog! You can access the code snippet on GitHub. Will walk you through another interesting topic tomorrow! ??

要查看或添加评论，请登录

L Koushik Kumar的更多文章

The TensorFlow NumPy API - Google I/O 2021

2021年5月23日

The TensorFlow NumPy API - Google I/O 2021

TensorFlow team in the recent Google I/O 2021 event has announced NumPy API. Kemal El Moujahid, the product director…
NLP: Spam - Ham classification through TF-IDF Vectorizer

2021年1月26日

NLP: Spam - Ham classification through TF-IDF Vectorizer

In the earlier blog, we went through spam - ham classifier with bag of words vectorizer, and that gave 90% of accuracy,…

2 条评论
NLP: Spam - Ham classification through Bag of Words

2021年1月25日

NLP: Spam - Ham classification through Bag of Words

In the last two blogs, we have been through basic concepts of NLP. Lets look at the implementation part through a spam…
NLP: Bag of words and TF-IDF explained!

2021年1月24日

NLP: Bag of words and TF-IDF explained!

In the previous article, we have been through tokenization, use of stop words, stemming and lemmatization. Basically…

Natural Language Processing - Begin the learning naturally!

L Koushik Kumar

Head of AI Practice at Aptagrim limited

Tokenization

1. Sentence Tokenization:

2. Word Tokenization:

We can address these points with help of few techniques:

a. Regular expressions:

b. Stop words:

c. Stemming and Lemmatization:

Implementation for Regular Expressions, Stop words and Stemming / Lemmatization:

L Koushik Kumar的更多文章

社区洞察

其他会员也浏览了

Understanding Natural Language Processing: A Guide for Young Minds

The wave of Natural Language Processing in Business Intelligence

Performing Natural Language Processing with R

Nine Ways You Use Natural Language Processing Every Day, and Why It Matters

7 Of The Leading Language Models for NLP

7 Of The Leading Language Models for NLP

Unveiling the Mechanics of Transformer Models in NLP: A Dive into Self-Attention

7 Of The Leading Language Models for NLP

Natural Language Processing with simple Classification Analysis

The evolution of Natural Language Processing and its impact on the legal sector

Tokenization

1. Sentence Tokenization:

2. Word Tokenization:

We can address these points with help of few techniques:

a. Regular expressions:

b. Stop words:

c. Stemming and Lemmatization:

Implementation for Regular Expressions, Stop words and Stemming / Lemmatization:

L Koushik Kumar的更多文章

The TensorFlow NumPy API - Google I/O 2021

NLP: Spam - Ham classification through TF-IDF Vectorizer

NLP: Spam - Ham classification through Bag of Words

NLP: Bag of words and TF-IDF explained!

社区洞察

其他会员也浏览了

Understanding Natural Language Processing: A Guide for Young Minds

The wave of Natural Language Processing in Business Intelligence

Performing Natural Language Processing with R

Nine Ways You Use Natural Language Processing Every Day, and Why It Matters

7 Of The Leading Language Models for NLP

7 Of The Leading Language Models for NLP

Unveiling the Mechanics of Transformer Models in NLP: A Dive into Self-Attention

7 Of The Leading Language Models for NLP

Natural Language Processing with simple Classification Analysis

The evolution of Natural Language Processing and its impact on the legal sector