Natural Language Processing (NLP)

Natural Language Processing (NLP)

As per Wikipedia, it is a subfield of linguistic, Computer Science, and Artificial Intelligence with the interaction between computers and humans. 

In common words, it is like giving computers the ability to understand the text and spoken words in much the same way human beings can.

There are three important aspects of NLP:

1. Tokenization

2. Stemming

3. Lemmatization 


1. Tokenization: It is a process of breaking down a piece of text into smaller units (sentence or word) called tokens. 

Code to tokenize a para into sentence and word:

Sentence=nltk.sent_tokenize(para) #NLTK is a leading platform for building Python programs to work with human language data.

Word=nltk.word_tokenize(para)


2. Stemming: It is a process of reducing words to the base words. Sometimes stem words don’t carry any meaning.

Example- “Finally”, “Final”, “Finalized” will be stemmed to “Fina”

Code to stem words:

from nltk.stem import PorterStemmer # From NLTK library, PorterStemmer is used to stem words 

PS=PorterStemmer()

Word= PS.stem(words)


3. Lemmatization: It is a process of reducing words to the base words, but lemmatized words carry meaning.

Example- “Finally”, “Final”, “Finalized” will be lemmatized to “Final”

Code to lemmatize words:

from nltk.stem import WordNetLemmatizer # From NLTK library, WordNetLemmatizer is used to lemmatize words

WNL= WordNetLemmatizer ()

Word= WNL.Lemmatize(words)

Here is the example of a problem statement I worked on using NLP

Problem Statement: Use machine learning to create a model that identifies spam and ham mails

Solution: 

1. Importing all the functions from nltk library (PorterStemmer, Stopwords)

2. Importing pandas to read the file

3. Assigning a variable to function PorterStemmer

4. Loop to perform Data Cleaning

5. Importing CountVectorizer from sklearn.feature_extraction.text to perform BOW- BagOfWords

6. Splitting the data into Train and Train dataset

7. Importing MultinomialNB from sklearn.naive_bayes for predicting Y_predict

8. Importing confusion_matrix to compare y_test and y_predict

9. Finding accuracy_score by importing from sklearn.metrics

Code - Git_Hub_Link

要查看或添加评论,请登录

Prashil Wanjari的更多文章

  • Data Visualization

    Data Visualization

    Data visualization is the most important part of decision making. Analyst can jump to conclusion after analyzing…

  • A glimpse of Machine Learning

    A glimpse of Machine Learning

    Machine learning is a process of imitating humans with the help of algorithms. In simple terms, it involves copying the…

  • Introduction to Analytics

    Introduction to Analytics

    What is analytics? Analytics is a branch consisting of statistics, machine learning – deep learning algorithms, data…

  • Netflix Dataset Visualization – Tableau

    Netflix Dataset Visualization – Tableau

    Like every other aspiring BA student, I tried my hands on Tableau. This is the first time I used the software to build…

  • House Prices - Regression Techniques

    House Prices - Regression Techniques

    With some experience in Machine learning and Python, I tried my hands-on problem statement where one to predict the…

  • First Step towards Machine Learning

    First Step towards Machine Learning

    The increase in demand for Data science engineers has made students think of Analytics as a good career. Many…

社区洞察

其他会员也浏览了