Natural Language Processing (NLP)
Prashil Wanjari
Business analyst | Lean Six Sigma | AWS Cloud Practitioner | PSPO - WIP | Digital Transformation | Business Transformation | CRM Implementation
As per Wikipedia, it is a subfield of linguistic, Computer Science, and Artificial Intelligence with the interaction between computers and humans.
In common words, it is like giving computers the ability to understand the text and spoken words in much the same way human beings can.
There are three important aspects of NLP:
1. Tokenization
2. Stemming
3. Lemmatization
1. Tokenization: It is a process of breaking down a piece of text into smaller units (sentence or word) called tokens.
Code to tokenize a para into sentence and word:
Sentence=nltk.sent_tokenize(para) #NLTK is a leading platform for building Python programs to work with human language data.
Word=nltk.word_tokenize(para)
2. Stemming: It is a process of reducing words to the base words. Sometimes stem words don’t carry any meaning.
Example- “Finally”, “Final”, “Finalized” will be stemmed to “Fina”
Code to stem words:
from nltk.stem import PorterStemmer # From NLTK library, PorterStemmer is used to stem words
PS=PorterStemmer()
Word= PS.stem(words)
3. Lemmatization: It is a process of reducing words to the base words, but lemmatized words carry meaning.
Example- “Finally”, “Final”, “Finalized” will be lemmatized to “Final”
Code to lemmatize words:
from nltk.stem import WordNetLemmatizer # From NLTK library, WordNetLemmatizer is used to lemmatize words
WNL= WordNetLemmatizer ()
Word= WNL.Lemmatize(words)
Here is the example of a problem statement I worked on using NLP
Problem Statement: Use machine learning to create a model that identifies spam and ham mails
Solution:
1. Importing all the functions from nltk library (PorterStemmer, Stopwords)
2. Importing pandas to read the file
3. Assigning a variable to function PorterStemmer
4. Loop to perform Data Cleaning
5. Importing CountVectorizer from sklearn.feature_extraction.text to perform BOW- BagOfWords
6. Splitting the data into Train and Train dataset
7. Importing MultinomialNB from sklearn.naive_bayes for predicting Y_predict
8. Importing confusion_matrix to compare y_test and y_predict
9. Finding accuracy_score by importing from sklearn.metrics
Code - Git_Hub_Link