ML based sentiment analysis of movie reviews

ML based sentiment analysis of movie reviews

Today lets learn about sentiment analysis of movie reviews. We will be training Naive Bayes classifier which is binary classifier provided NLTK python library. After training, the classifier will classify a movie review as positive or negative. The data set provided by Cornell university. For this tutorial you need to know basics of python and working with external libraries

Step 1: Install NLTK library

$ pip install nltk

The above command is a straight forward installation of NLTK as you would do for any other library

Step 2: Download dataset: Download the sentence polarity dataset v1.0 which has 5331 positive and 5331 negative reviews from https://www.cs.cornell.edu/people/pabo/movie-review-data/. After downloading extract the folder and keep it in same location as your python file

Open your favourite text editor. Lets start building!

Step 3: reading the movie reviews

import nltk
posFilePath='./rt-polaritydata/rt-polaritydata/rt-polarity.pos'
negFilePath='./rt-polaritydata/rt-polaritydata/rt-polarity.neg'

with open(posFilePath, encoding="latin-1") as f:
    posData=f.readlines()
    
with open(negFilePath,encoding="latin-1") as f:
    negData=f.readlines()

We are reading negative and positive reviews from the file and storing the contents in a local variable

Step 4: Data set preparation

We will split the data into test and training set

testSpltIndx=2500
testNegRev=negData[testSpltIndx+1:]
testPosRev=posData[testSpltIndx+1:]

trainNegRev=negData[:testSpltIndx]
trainPosRev=posData[:testSpltIndx]

Step 5: Preparing vocabulary

Vocabulary is set of words present in our positive and negative movie reviews

def getVocab():
    posWordLst = [word for line in trainPosRev for word in line.split()]
    negWordLst = [word for line in trainNegRev for word in line.split()]
    allWordsList = posWordLst+negWordLst
    vocab = list(set(allWordsList))
    
    return vocab

In the above method we are reading the positive and negative training dataset. We ll split each sentience into words and add it to a list. Once we get the both lists, we ll combine both into one list. Now the list contains repeated words, to remove those we use the set() method which forms a list where no elements are repeated.The set is converted back to final list. The final list contains all the words in our training movie reviews with no repeated words.

Step 6: Prepare training data

The Naive Bayes classifier accepts the training data in a certain format. So we have to convert our dataset into acceptable format.

def getTrainingData():
    negTaggedRevLst = [{'review':review.split(),'label':'negative'} for review in trainNegRev]
    posTaggedRevLst = [{'review':review.split(),'label':'positive'} for review in trainPosRev]
    fullyTaggedTrainData = negTaggedRevLst+posTaggedRevLst

    trainData =[(reviewObj['review'],reviewObj['label']) for reviewObj in           fullyTaggedTrainData]
    return trainData

We are reading the training positive and negative review dataset and creating a object with keys 'review' and 'label'. The value of key 'review' contains list of words of a review and value of key 'label' is 'positive' or 'negative'. We combine both the lists. Finally we read this list and form a set. The set contains review and label of each review. This insert into another list.

The final format of the list should be something like this

[(['simplistic', ',', 'silly', 'and', 'tedious', '.'], 'negative'),.....]

Step 7: Extract features

Here features represent a dictionary with key as a word in our vocabulary and value as 'True' or 'False' depending on whether the word is present in a review.

vocabulary = getVocab()


def extractFeatures(review):
    review_words=set(review)
    features={}
    for word in vocabulary:
        features[word]=(word in review_words)
    
    return features

In the move method we pass in a review. We loop through all the words in our vocabulary and check if word in the current pass is present in the review. For each review we ll generate a unique dictionary of feature vectors. Eg:

{'up': False, '77-minute': False, '1940s': False, 'washington': False, 'preference': False, 'riveted': False, 'jingles': False, 'santa': False, 'cheats': False, 'escaped': False, 'whimsy': False, 'warmly': False, 'cheat': False, 'unexplained': False, 'niche': False, 'philbin': False, 'fracasso': False, .....

We ll not be directly using this method but we ll pass it NLTK api

Till now our data preprocessing is done. Now the exciting part! D

Step 8: Train and get the Trained classifier

def getTrainedNaiveBayesClassifier(extract_feaures,trainingData):
    trainingFeatures = nltk.classify.apply_features(extract_feaures,trainingData)
    trainedNBClassifier = nltk.NaiveBayesClassifier.train(trainingFeatures)
    return trainedNBClassifier  

In the above method we will pass in our training data set and extractFeatures method to get the training features. After this we ll pass the training features to the train method given by NLTK api. The result is trained Naive Bayes classifier.


Step 9: Build the Naive Bayes sentiment calculator

Now that we have the trained NB classifier. Lets build a method which accepts a review and outputs whether its a positive or negative review.

def nbSentimentCalc(review):
    problemInstance = review.split()
    problemFeatures = extractFeatures(problemInstance)
    return trainedNBClassifier.classify(problemFeatures)

In the above method we are passing a review in string format. We ll split the review into list of words. We ll pass this list to extractFeatures method which ll return feature vector (dictionary with each word and corresponding True or False).

Step 10: Lets test it

its the moment of truth. Lets pass some reviews and see what it outputs :D

> nbSentimentCalc("what a bad acting")
'negative'

> nbSentimentCalc("Awesome, must watch!")
'positive'

Step 11: Diagnosis (Optional)

If you want to test how good is this algorithm for your problem, you can run some simple statistics on comparing the accuracy of test data

def getTestReviewSetiment(nbSentimentCalc):
    testNegRes = [nbSentimentCalc(review) for review in testNegRev]
    testPosRes = [nbSentimentCalc(review) for review in testPosRev]
    
    labelToNum = {'positive':1,'negative':-1}
    numericNegRes = [labelToNum[x] for x in testNegRes]
    numericPosRes = [labelToNum[x] for x in testPosRes]
    return {'results-on-positive':numericPosRes,'results-on-negetive':numericNegRes}

The above method reads the data from test dataset and passes it to the trained classifier. The output is put in a list. The both the lists are combined to a object of positive and negative reviews.

 def runDiagnostics(reviewResult):
    posReviewRes = reviewResult['results-on-positive']
    negReviewRes = reviewResult['results-on-negetive']
    
    numTruePositive = sum(x>0 for x in posReviewRes)
    numTreuNegetive = sum(x<0 for x in negReviewRes)
    
    pctTruePos = float(numTruePositive)/len(posReviewRes)
    pctTrueNeg = float(numTreuNegetive)/len(negReviewRes)
    totalAccurate = numTreuNegetive+numTruePositive
    total = len(posReviewRes)+len(negReviewRes)
    print("accuracy of +ve rev ="+"%.2f"%(pctTruePos*100)+"%")
    print("Accuracy of -ve rev ="+"%.2f"%(pctTrueNeg*100)+"%")

The above method reads the test data results and sums up the positive and negative reviews from results. We then divide it by sum of all the positive and negative results. Then percentage of positive and negative reviews is calculated.

Done! <3

Whole source code is present in Google Colab and Github :)

I learnt this from a tutorial in Pluralsight. Thanks to Vitthal Srinivasan!

Let me know what you guys think! Your comments and feedback are welcome :)

要查看或添加评论,请登录

Abhishek Kori的更多文章

社区洞察

其他会员也浏览了