Natural Language Processing - Begin the learning naturally!
Natural language processing is all about making computers understand human language, and intern generates human language. Off late, there have been many applications developed through NLP, these include Chatbots, language translation, captioning, text generation, and a lot more. The journey of NLP started in the early 20th century with syntactic structures and evolved till Dall-E, a recent release in 2021 by OpenAI. Dall-E can understand the linguistic input and draw accurate pictures.
To understand any model implementation, we'd need to have the basics right. It can even be how inputs are processed and sent to the model. We will now be walking through the very basics of NLP such as Tokenization, Stemming, Lemmatization, and use of stop words. I am not trying to provide you with textbook definitions, it's merely the understanding and application. The blog series will continue on Bag of words, TF-IDF, Word2Vec, spam - ham classifiers, NLP through LSTMs, Attention-based models, and transformers! Let keep the enthusiasm of learning and dive into the basics now.
Tokenization
Tokenization is splitting a paragraph, sentence, or even an entire document into smaller units such as individual sentences, words, or terms. The smaller units here are called Tokens. Consider the sentence 'I love my cat'. Performing tokenization on this sentence should provide a list of words - ['I', 'love', 'my' 'cat']. Tokenization is broadly classified into two types: Word tokenization and Sentence tokenization.
1. Sentence Tokenization:
Performing sentence tokenization gives a list of sentences in the given paragraph or document. For example, the paragraph - "I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories." on sentence tokenization gives list of sentences
import nltk nltk.download() # Download all required packages for text processing paragraph = """I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.""" #If you are wondering why three quotes - Python's triple quotes helps by allowing strings to span multiple lines, #that includes new line characters, tab spaces or any special characters. sentences = nltk.sent_tokenize(paragraph)
Output:
['I am honored to be with you today at your commencement from one of the finest universities in the world.', 'I never graduated from college.', 'Truth be told, this is the closest I’ve ever gotten to a college graduation.', 'Today I want to tell you three stories from my life.', 'That’s it.', 'No big deal.', 'Just three stories.']
2. Word Tokenization:
Performing word tokenization gives a list of words in the given paragraph or document. It is the same as sentence tokenization, except that, instead of sentences, we'd get a list of words available in the sentence.
Consider the same sentence taken as input, and lets' see the output upon performing word tokenization
import nltk #nltk.download() # Download all required packages for text processing paragraph = """I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.""" words = nltk.word_tokenize(paragraph)
Output:
['I', 'am', 'honored', 'to', 'be', 'with', 'you', 'today', 'at', 'your', 'commencement', 'from', 'one', 'of', 'the', 'finest', 'universities', 'in', 'the', 'world', '.', 'I', 'never', 'graduated', 'from', 'college', '.', 'Truth', 'be', 'told', ',', 'this', 'is', 'the', 'closest', 'I', '’', 've', 'ever', 'gotten', 'to', 'a', 'college', 'graduation', '.', 'Today', 'I', 'want', 'to', 'tell', 'you', 'three', 'stories', 'from', 'my', 'life', '.', 'That', '’', 's', 'it', '.', 'No', 'big', 'deal', '.', 'Just', 'three', 'stories', '.']
Few points to note:
- We've got every word from the paragraph, and this includes punctuations and special characters as well
- As humans, we'd need the sentence completely to get the meaning, however, we can get the context of the sentence with the keywords in the sentence. Prepositions, conjunctions, articles can be taken out of the sentence, and still, it gives us the context.
- Plurals, verb forms, adverbs can be reduced to root words - the root words can help in giving the context without the need for their extensions.
We can address these points with help of few techniques:
a. Regular expressions:
With help of regular expressions, we can drop the numbers, special characters, punctuations, or any other word we can represent in form of the regular expressions.
Ex: sentences = re.sub('[^a-zA-Z]', ' ', sentences)
- The first argument takes in the regular expression, here mentioning to only consider all alphabets.
- Second argument specifies to replace any word or character that contains other than alphabets with empty string ''
- Third argument is the data on which regular expressions should be applied
b. Stop words:
NLTK library has a list of stop words specified for each language it supports. Looping the words or sentences we have in the data through the stop words set, we can perform the required operation to drop the stop words.
c. Stemming and Lemmatization:
This is the process of reducing words to their word stems. Stemming reduces the words its root, which may or may not give complete meaning to the user. Whereas Lemmatization generates the root word that is part of the language.
Implementation for Regular Expressions, Stop words and Stemming / Lemmatization:
import nltk from nltk.stem import PorterStemmer from nltk.corpus import stopwords paragraph = """I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.""" sentences = nltk.sent_tokenize(paragraph) words = nltk.word_tokenize(paragraph) stemmer = PorterStemmer() # Stemming for i in range(len(sentences)): sentences[i] = re.sub('[^a-zA-Z]', ' ', sentences[i]) words = nltk.word_tokenize(sentences[i]) words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))] sentences[i] = ' '.join(words) print(sentences)
The final sentence we'd have now, will not have any stop words.
['I honor today commenc one finest univers world .', 'I never graduat colleg .', 'truth told , closest I ’ ever gotten colleg graduat .', 'today I want tell three stori life .', 'that ’ .', 'No big deal .', 'just three stori .']
If we observe closely, there are words that do not give the complete meaning. Eg: Commenc → Commencement ; univers → Universities ; colleg → College ; Stori → Stories
These are root words obtained through stemming. As mentioned, one major problem we have with stemming is that the produced intermediate representation of a word may not give a complete meaning. Let us check the lemmatization route!
import nltk from nltk.stem import WordNetLemmatizer from nltk.corpus import stopwords paragraph = """I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.""" sentences = nltk.sent_tokenize(paragraph) words = nltk.word_tokenize(paragraph) lemmatizer = WordNetLemmatizer() # Stemming for i in range(len(sentences)): sentences[i] = re.sub('[^a-zA-Z]', ' ', sentences[i]) words = nltk.word_tokenize(sentences[i]) words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))] sentences[i] = ' '.join(words) print(sentences)
The output here will have root words that are meaningful.
['I honor today commenc one finest univers world .', 'I never graduat colleg .', 'truth told , closest I ’ ever gotten colleg graduat .', 'today I want tell three stori life .', 'that ’ .', 'No big deal .', 'just three stori .']
Hope you got some meaningful insights through this blog! You can access the code snippet on GitHub. Will walk you through another interesting topic tomorrow! ??