ML based sentiment analysis of movie reviews
Abhishek Kori
Product Manager | Wholesale Lending, Commercial Real estate | Skills: Product, Project Management, UI/UX, Programming, Data Analysis, Data Governance, Visualization, Communication | PSPO?
Today lets learn about sentiment analysis of movie reviews. We will be training Naive Bayes classifier which is binary classifier provided NLTK python library. After training, the classifier will classify a movie review as positive or negative. The data set provided by Cornell university. For this tutorial you need to know basics of python and working with external libraries
Step 1: Install NLTK library
$ pip install nltk
The above command is a straight forward installation of NLTK as you would do for any other library
Step 2: Download dataset: Download the sentence polarity dataset v1.0 which has 5331 positive and 5331 negative reviews from https://www.cs.cornell.edu/people/pabo/movie-review-data/. After downloading extract the folder and keep it in same location as your python file
Open your favourite text editor. Lets start building!
Step 3: reading the movie reviews
import nltk
posFilePath='./rt-polaritydata/rt-polaritydata/rt-polarity.pos'
negFilePath='./rt-polaritydata/rt-polaritydata/rt-polarity.neg'
with open(posFilePath, encoding="latin-1") as f:
posData=f.readlines()
with open(negFilePath,encoding="latin-1") as f:
negData=f.readlines()
We are reading negative and positive reviews from the file and storing the contents in a local variable
Step 4: Data set preparation
We will split the data into test and training set
testSpltIndx=2500
testNegRev=negData[testSpltIndx+1:]
testPosRev=posData[testSpltIndx+1:]
trainNegRev=negData[:testSpltIndx]
trainPosRev=posData[:testSpltIndx]
Step 5: Preparing vocabulary
Vocabulary is set of words present in our positive and negative movie reviews
def getVocab():
posWordLst = [word for line in trainPosRev for word in line.split()]
negWordLst = [word for line in trainNegRev for word in line.split()]
allWordsList = posWordLst+negWordLst
vocab = list(set(allWordsList))
return vocab
In the above method we are reading the positive and negative training dataset. We ll split each sentience into words and add it to a list. Once we get the both lists, we ll combine both into one list. Now the list contains repeated words, to remove those we use the set() method which forms a list where no elements are repeated.The set is converted back to final list. The final list contains all the words in our training movie reviews with no repeated words.
Step 6: Prepare training data
The Naive Bayes classifier accepts the training data in a certain format. So we have to convert our dataset into acceptable format.
def getTrainingData():
negTaggedRevLst = [{'review':review.split(),'label':'negative'} for review in trainNegRev]
posTaggedRevLst = [{'review':review.split(),'label':'positive'} for review in trainPosRev]
fullyTaggedTrainData = negTaggedRevLst+posTaggedRevLst
trainData =[(reviewObj['review'],reviewObj['label']) for reviewObj in fullyTaggedTrainData]
return trainData
We are reading the training positive and negative review dataset and creating a object with keys 'review' and 'label'. The value of key 'review' contains list of words of a review and value of key 'label' is 'positive' or 'negative'. We combine both the lists. Finally we read this list and form a set. The set contains review and label of each review. This insert into another list.
The final format of the list should be something like this
[(['simplistic', ',', 'silly', 'and', 'tedious', '.'], 'negative'),.....]
Step 7: Extract features
Here features represent a dictionary with key as a word in our vocabulary and value as 'True' or 'False' depending on whether the word is present in a review.
vocabulary = getVocab()
def extractFeatures(review):
review_words=set(review)
features={}
for word in vocabulary:
features[word]=(word in review_words)
return features
In the move method we pass in a review. We loop through all the words in our vocabulary and check if word in the current pass is present in the review. For each review we ll generate a unique dictionary of feature vectors. Eg:
{'up': False, '77-minute': False, '1940s': False, 'washington': False, 'preference': False, 'riveted': False, 'jingles': False, 'santa': False, 'cheats': False, 'escaped': False, 'whimsy': False, 'warmly': False, 'cheat': False, 'unexplained': False, 'niche': False, 'philbin': False, 'fracasso': False, .....
We ll not be directly using this method but we ll pass it NLTK api
Till now our data preprocessing is done. Now the exciting part! D
Step 8: Train and get the Trained classifier
def getTrainedNaiveBayesClassifier(extract_feaures,trainingData):
trainingFeatures = nltk.classify.apply_features(extract_feaures,trainingData)
trainedNBClassifier = nltk.NaiveBayesClassifier.train(trainingFeatures)
return trainedNBClassifier
In the above method we will pass in our training data set and extractFeatures method to get the training features. After this we ll pass the training features to the train method given by NLTK api. The result is trained Naive Bayes classifier.
Step 9: Build the Naive Bayes sentiment calculator
Now that we have the trained NB classifier. Lets build a method which accepts a review and outputs whether its a positive or negative review.
def nbSentimentCalc(review):
problemInstance = review.split()
problemFeatures = extractFeatures(problemInstance)
return trainedNBClassifier.classify(problemFeatures)
In the above method we are passing a review in string format. We ll split the review into list of words. We ll pass this list to extractFeatures method which ll return feature vector (dictionary with each word and corresponding True or False).
Step 10: Lets test it
its the moment of truth. Lets pass some reviews and see what it outputs :D
> nbSentimentCalc("what a bad acting")
'negative'
> nbSentimentCalc("Awesome, must watch!")
'positive'
Step 11: Diagnosis (Optional)
If you want to test how good is this algorithm for your problem, you can run some simple statistics on comparing the accuracy of test data
def getTestReviewSetiment(nbSentimentCalc):
testNegRes = [nbSentimentCalc(review) for review in testNegRev]
testPosRes = [nbSentimentCalc(review) for review in testPosRev]
labelToNum = {'positive':1,'negative':-1}
numericNegRes = [labelToNum[x] for x in testNegRes]
numericPosRes = [labelToNum[x] for x in testPosRes]
return {'results-on-positive':numericPosRes,'results-on-negetive':numericNegRes}
The above method reads the data from test dataset and passes it to the trained classifier. The output is put in a list. The both the lists are combined to a object of positive and negative reviews.
def runDiagnostics(reviewResult):
posReviewRes = reviewResult['results-on-positive']
negReviewRes = reviewResult['results-on-negetive']
numTruePositive = sum(x>0 for x in posReviewRes)
numTreuNegetive = sum(x<0 for x in negReviewRes)
pctTruePos = float(numTruePositive)/len(posReviewRes)
pctTrueNeg = float(numTreuNegetive)/len(negReviewRes)
totalAccurate = numTreuNegetive+numTruePositive
total = len(posReviewRes)+len(negReviewRes)
print("accuracy of +ve rev ="+"%.2f"%(pctTruePos*100)+"%")
print("Accuracy of -ve rev ="+"%.2f"%(pctTrueNeg*100)+"%")
The above method reads the test data results and sums up the positive and negative reviews from results. We then divide it by sum of all the positive and negative results. Then percentage of positive and negative reviews is calculated.
Done! <3
Whole source code is present in Google Colab and Github :)
I learnt this from a tutorial in Pluralsight. Thanks to Vitthal Srinivasan!
Let me know what you guys think! Your comments and feedback are welcome :)