登录查看更多内容

ML based sentiment analysis of movie reviews

Abhishek Kori

Product Manager | Wholesale Lending, Commercial Real estate | Skills: Product, Project Management, UI/UX, Programming, Data Analysis, Data Governance, Visualization, Communication | PSPO?

发布日期: 2018年10月20日

Today lets learn about sentiment analysis of movie reviews. We will be training Naive Bayes classifier which is binary classifier provided NLTK python library. After training, the classifier will classify a movie review as positive or negative. The data set provided by Cornell university. For this tutorial you need to know basics of python and working with external libraries

Step 1: Install NLTK library

$ pip install nltk

The above command is a straight forward installation of NLTK as you would do for any other library

Step 2: Download dataset: Download the sentence polarity dataset v1.0 which has 5331 positive and 5331 negative reviews from https://www.cs.cornell.edu/people/pabo/movie-review-data/. After downloading extract the folder and keep it in same location as your python file

Open your favourite text editor. Lets start building!

Step 3: reading the movie reviews

import nltk
posFilePath='./rt-polaritydata/rt-polaritydata/rt-polarity.pos'
negFilePath='./rt-polaritydata/rt-polaritydata/rt-polarity.neg'

with open(posFilePath, encoding="latin-1") as f:
    posData=f.readlines()
    
with open(negFilePath,encoding="latin-1") as f:
    negData=f.readlines()

We are reading negative and positive reviews from the file and storing the contents in a local variable

Step 4: Data set preparation

We will split the data into test and training set

testSpltIndx=2500
testNegRev=negData[testSpltIndx+1:]
testPosRev=posData[testSpltIndx+1:]

trainNegRev=negData[:testSpltIndx]
trainPosRev=posData[:testSpltIndx]

Step 5: Preparing vocabulary

Vocabulary is set of words present in our positive and negative movie reviews

def getVocab():
    posWordLst = [word for line in trainPosRev for word in line.split()]
    negWordLst = [word for line in trainNegRev for word in line.split()]
    allWordsList = posWordLst+negWordLst
    vocab = list(set(allWordsList))
    
    return vocab

In the above method we are reading the positive and negative training dataset. We ll split each sentience into words and add it to a list. Once we get the both lists, we ll combine both into one list. Now the list contains repeated words, to remove those we use the set() method which forms a list where no elements are repeated.The set is converted back to final list. The final list contains all the words in our training movie reviews with no repeated words.

Step 6: Prepare training data

The Naive Bayes classifier accepts the training data in a certain format. So we have to convert our dataset into acceptable format.

def getTrainingData():
    negTaggedRevLst = [{'review':review.split(),'label':'negative'} for review in trainNegRev]
    posTaggedRevLst = [{'review':review.split(),'label':'positive'} for review in trainPosRev]
    fullyTaggedTrainData = negTaggedRevLst+posTaggedRevLst

    trainData =[(reviewObj['review'],reviewObj['label']) for reviewObj in           fullyTaggedTrainData]
    return trainData

We are reading the training positive and negative review dataset and creating a object with keys 'review' and 'label'. The value of key 'review' contains list of words of a review and value of key 'label' is 'positive' or 'negative'. We combine both the lists. Finally we read this list and form a set. The set contains review and label of each review. This insert into another list.

The final format of the list should be something like this

[(['simplistic', ',', 'silly', 'and', 'tedious', '.'], 'negative'),.....]

Step 7: Extract features

Here features represent a dictionary with key as a word in our vocabulary and value as 'True' or 'False' depending on whether the word is present in a review.

vocabulary = getVocab()


def extractFeatures(review):
    review_words=set(review)
    features={}
    for word in vocabulary:
        features[word]=(word in review_words)
    
    return features

In the move method we pass in a review. We loop through all the words in our vocabulary and check if word in the current pass is present in the review. For each review we ll generate a unique dictionary of feature vectors. Eg:

{'up': False, '77-minute': False, '1940s': False, 'washington': False, 'preference': False, 'riveted': False, 'jingles': False, 'santa': False, 'cheats': False, 'escaped': False, 'whimsy': False, 'warmly': False, 'cheat': False, 'unexplained': False, 'niche': False, 'philbin': False, 'fracasso': False, .....

We ll not be directly using this method but we ll pass it NLTK api

Till now our data preprocessing is done. Now the exciting part! D

Step 8: Train and get the Trained classifier

def getTrainedNaiveBayesClassifier(extract_feaures,trainingData):
    trainingFeatures = nltk.classify.apply_features(extract_feaures,trainingData)
    trainedNBClassifier = nltk.NaiveBayesClassifier.train(trainingFeatures)
    return trainedNBClassifier

In the above method we will pass in our training data set and extractFeatures method to get the training features. After this we ll pass the training features to the train method given by NLTK api. The result is trained Naive Bayes classifier.

Step 9: Build the Naive Bayes sentiment calculator

Now that we have the trained NB classifier. Lets build a method which accepts a review and outputs whether its a positive or negative review.

def nbSentimentCalc(review):
    problemInstance = review.split()
    problemFeatures = extractFeatures(problemInstance)
    return trainedNBClassifier.classify(problemFeatures)

In the above method we are passing a review in string format. We ll split the review into list of words. We ll pass this list to extractFeatures method which ll return feature vector (dictionary with each word and corresponding True or False).

Step 10: Lets test it

its the moment of truth. Lets pass some reviews and see what it outputs :D

> nbSentimentCalc("what a bad acting")
'negative'

> nbSentimentCalc("Awesome, must watch!")
'positive'

Step 11: Diagnosis (Optional)

If you want to test how good is this algorithm for your problem, you can run some simple statistics on comparing the accuracy of test data

def getTestReviewSetiment(nbSentimentCalc):
    testNegRes = [nbSentimentCalc(review) for review in testNegRev]
    testPosRes = [nbSentimentCalc(review) for review in testPosRev]
    
    labelToNum = {'positive':1,'negative':-1}
    numericNegRes = [labelToNum[x] for x in testNegRes]
    numericPosRes = [labelToNum[x] for x in testPosRes]
    return {'results-on-positive':numericPosRes,'results-on-negetive':numericNegRes}

The above method reads the data from test dataset and passes it to the trained classifier. The output is put in a list. The both the lists are combined to a object of positive and negative reviews.

 def runDiagnostics(reviewResult):
    posReviewRes = reviewResult['results-on-positive']
    negReviewRes = reviewResult['results-on-negetive']
    
    numTruePositive = sum(x>0 for x in posReviewRes)
    numTreuNegetive = sum(x<0 for x in negReviewRes)
    
    pctTruePos = float(numTruePositive)/len(posReviewRes)
    pctTrueNeg = float(numTreuNegetive)/len(negReviewRes)
    totalAccurate = numTreuNegetive+numTruePositive
    total = len(posReviewRes)+len(negReviewRes)
    print("accuracy of +ve rev ="+"%.2f"%(pctTruePos*100)+"%")
    print("Accuracy of -ve rev ="+"%.2f"%(pctTrueNeg*100)+"%")

The above method reads the test data results and sums up the positive and negative reviews from results. We then divide it by sum of all the positive and negative results. Then percentage of positive and negative reviews is calculated.

Done! <3

Whole source code is present in Google Colab and Github :)

I learnt this from a tutorial in Pluralsight. Thanks to Vitthal Srinivasan!

Let me know what you guys think! Your comments and feedback are welcome :)

查看更多评论

要查看或添加评论，请登录

Abhishek Kori的更多文章

Feature prioritization: Seven questions to ask yourself

2020年10月18日

Feature prioritization: Seven questions to ask yourself

Originally published on productshek.com In an early-stage startup, as a founder/product manager, you have to always…
Summary: The Root Causes of Failed Product Efforts - Inspired Book

2020年7月2日

Summary: The Root Causes of Failed Product Efforts - Inspired Book

Introduction I have written a full blog post on my website productshek .com Inspired: How to Create Tech Products…

1 条评论
Cross-Industry Standard Process for Data Mining (CRISP-DM)

2020年3月13日

Cross-Industry Standard Process for Data Mining (CRISP-DM)

CRISP-DM is a proven process to carry out data mining. According to Wikipedia, it originated in 1996 became a European…
Rock Paper Scissor classifier using fast AI and restnet 34 pretrained model

2019年8月18日

Rock Paper Scissor classifier using fast AI and restnet 34 pretrained model

Rock Paper Scissors image classifier using Fast AI library We will be executing the below code in…

4 条评论
Dead simple heroku flask app

2018年7月14日

Dead simple heroku flask app

Hi guys, it has been long time i built any app and host it on heroku. Today i was just trying to build a simple flask…

4 条评论

See all articles

ML based sentiment analysis of movie reviews

Abhishek Kori

Product Manager | Wholesale Lending, Commercial Real estate | Skills: Product, Project Management, UI/UX, Programming, Data Analysis, Data Governance, Visualization, Communication | PSPO?

Abhishek Kori的更多文章

社区洞察

其他会员也浏览了

Simple Linear Regression Practical Example

?? Mellin Transform Demystified: Python Tackles Satellite Patterns ???

SIMPLE LINEAR REGRESSION IN PYTHON :

Class 8 - STRING MANIPULATION & BASIC STRUCTURES IN PYTHON Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

A Gentle Introduction to XGBoost for Applied Machine Learning

A detailed K-nearest Neighbors classifier in Python

Python scikit-learn Toolkit

Machine Learning at scale, what about runtime performance?

17 Top Applications of Machine Learning with Python

Wheat Seed Classification Prediction Using Pyaret(Multiclass Classification).

Abhishek Kori的更多文章

Feature prioritization: Seven questions to ask yourself

Summary: The Root Causes of Failed Product Efforts - Inspired Book

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Rock Paper Scissor classifier using fast AI and restnet 34 pretrained model

Dead simple heroku flask app

社区洞察

其他会员也浏览了

Simple Linear Regression Practical Example

?? Mellin Transform Demystified: Python Tackles Satellite Patterns ???

SIMPLE LINEAR REGRESSION IN PYTHON :

Class 8 - STRING MANIPULATION & BASIC STRUCTURES IN PYTHON Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

A Gentle Introduction to XGBoost for Applied Machine Learning

A detailed K-nearest Neighbors classifier in Python

Python scikit-learn Toolkit

Machine Learning at scale, what about runtime performance?

17 Top Applications of Machine Learning with Python

Wheat Seed Classification Prediction Using Pyaret(Multiclass Classification).