Predicting Yelp Rating from Review Text Using Sentiment Analysis
Nagham Jaoudeh
Senior Data Product Developer & Analyst at International Rescue Committee
Abstract
Sentiment analysis and prediction of review ratings on the Yelp reviews dataset was created using various machine learning algorithms. Jupiter notebook was used with many models for better analysis and identification of links between reviews and ratings. The dataset is a CSV file that consisted of 10 columns and almost 10,000 records dating between 2005 and 2013. Variables consisting of the business id, the date, the review type, the review, the starts, and the tag description of the review in the historical setting that notes whether it was cool, useful, or funny. A correlation analysis was done in order to track any link between the voting columns or between the number of stars and the review itself. Multinomial Naive Bayes, Random Forest, Decision Tree, Support Vector Machine, Gradient Boosting Classifier, K Nearest Neighbor, XGBoost Classifier, and multilayer perceptron classifier were all used in the modeling phase. A comparison of the F score of each algorithm was done in order to choose the best model to predict a random positive review, a random average review, and a random negative review.
Introduction
Natural language processing (or NLP) serves numerous use cases when dealing with text or unstructured text data. Imagine if someone worked for Google News and wanted to group news articles by topic. Or, imagine a legal firm and an employee had to find documents relevant to a particular case. It would be very tiring and time-consuming to manually sift through thousands of articles. This is where NLP could come in handy. In this research, a sentiment analysis model will be built in order to predict if a user will a local business or not, based on their review on Yelp (Luca, 2011).
Yelp is an American multinational corporation founded in 2004 that aims to help people locate local businesses based on social networking functionally? and reviews. The main purpose of Yelp is to provide a platform for customers to write reviews along with providing a star-rating with an open-ended comment. Yelp data is reliable, up-to-date, and has a wide coverage of all kinds of businesses. Millions of people use Yelp and empirical data demonstrated that Yelp restaurant reviews affected consumers' food choice decision-making; a one-star increase led to a 59% increase in revenue of independent restaurants (Luca, 2011). With the rapid growth of visitors and users, there’s a great potential for the Yelp restaurant reviews dataset as a valuable insight repository. An increasing number of customers rely on Yelp for food hunting. Therefore, the review on Yelp has become an important index for the food industry. In recent years a growing number of research has been focusing on Yelp. The cited papers include a review, reputation and revenue relationship research (Luca, 2011), Groupon effect (Byers, Mitzenmacher & Zervas, 2012), and an exploration of why people use Yelp (Hicks et al, 2012). Since reviews make up the greatest component for Yelp, investigations into them via machine learning techniques were expected to yield interesting discoveries. For instance, a fake review filter was developed (Mukherjee, Venkataraman & Liu, 2013) and tested the efficiency of Yelp’s abnormal spamming algorithm. This paper also applies the idea of natural language processing (NLP) to yelp data, but it focused on the field of sentiment analysis which was conducted by a high-efficiency support vector machine (SVM) model. Sentiment Analysis, also known as opinion mining, is the process of determining whether a text unit is positive or negative. It can have a wide range of applications such as automatically detecting feedback towards products, news, and characters, or improving customers’ relation model (Hicks et al, 2012).
The overall objective of this project is to predict customer satisfaction through sentiment analysis using many variables. The most important variable in the dataset is the text written by the customer as a review along with the number of starts. These two fields will be used in the training dataset to train the data? and use the output for predicting the satisfaction level of each customer. Various approaches have been used to evaluate the sentiment underneath the words and expressions or documents. Some of the most common machine learning algorithms used in NLP fields include Naive Bayes (NB), Maximum Entropy (ME), Support Vector Machine (SVM) (Joachims, 1998), and unsupervised learning (Turney, 2002). Before the rapid development of neural network-based methods (Santo & Gatti, 2014)most recently, Linear SVMs often gave the best performance (Mullen & Collier, 2004) in NLP.
Literature Review
This research was done in order to predict the review ratings on a Yelp dataset using various machine learning algorithms for sentiment analysis techniques. In the article, "Understanding the information and communication technology needs of the e-humanist," Tomas and Obrian (2006) discussed the need of humanists with respect to information and communication technology (ICT) by using artificial intelligence and machine learning algorithms to better understand human communication and evaluation for products and services in the market. In another research, paper the authors talked about the (Bejarano, Jindal, 2010) users’ influence in the Yelp recommender system. Recommender systems collect information about users and businesses and how they are related. This relation is given in terms of reviews and votes on reviews. User reviews gather opinions, rating scores and review influence. The latter component is crucial for determining which users are more relevant in a recommender system, that is, the users whose reviews are more popular than the average user’s reviews. Naive Bayes is one of the machine learning algorithms mostly used for predicting reviews.
In their journal, Sánchez-Franco, M. J., Navarro-García, A., & Rondán-Catalu?a, F. J. studied a Naive Bayes strategy for classifying customer satisfaction: A study based on online reviews of hospitality services. They have assessed whether terms related to guest experience can be used to identify ways to enhance hospitality services. A study was conducted to empirically identify relevant features to classify customer satisfaction based on 47,172 reviews of 33 Las Vegas hotels registered with Yelp. The resulting model can help hotel managers understand guests' satisfaction. It can help managers’ process vast amounts of review data by using a supervised machine learning approach. The naive algorithm classifies reviews of hotels with high precision and recall and with a low computational cost. These results are more reliable and accurate than prior statistical results based on limited sample data and provide insights into how hotels can improve their services based on, for example, staff experience, professionalism, tangible and experiential factors, and gambling-based attractions (Franco, García, Catalu?a, 2019).
Discovering foodborne illness in online restaurant reviews is another research that developed a system for the discovery of foodborne illness mentioned in online Yelp restaurant reviews using text classification. The system is used by the New York City Department of Health and Mental Hygiene (DOHMH) to monitor Yelp for foodborne illness complaints. The Materials and Methods are through building classifiers for 2 tasks: (1) determining if a review indicated a person experiencing foodborne illness and (2) determining if a review indicated multiple people experiencing foodborne illness. They first developed a prototype classifier in 2012 for both tasks using a small, labeled dataset. Over years of system deployment, DOHMH epidemiologists labeled over 13,500 (Jobs, 2014) reviews selected by this classifier. They used these biased data and a sample of complementary reviews in a principled bias-adjusted training scheme to develop significantly improved classifiers. Finally, they performed an error analysis of the best resulting classifiers. They found that logistic regression trained with bias-adjusted augmented data performs. The error analysis revealed that the inability of the models to account for long phrases caused the most errors. Our bias-adjusted training scheme illustrates how to improve a classification system iteratively by exploiting available biased labeled data. As a conclusion, the system has been instrumental in the identification of 10 outbreaks and 8523 complaints of foodborne illness associated with New York City restaurants since July 2012 In the article, "Understanding the information and communication technology needs of the e-humanist", Tomas and Obrian (2006) talked about. The evaluation has identified strong classifiers for both tasks, whose deployment will allow DOHMH epidemiologists to more effectively monitor Yelp for foodborne illness.
Approach & Methodology
For this project, Multinomial Naive Bayes, Random Forest, Decision Tree, Support Vector Machine, Gradient Boosting Classifier, K Nearest Neighbor, XGBoost Classifier, multilayer perceptron classifier will be used in order to predict yelp rating from review text. The data set is rather large, what Yelp does is building a big ensemble model. Models are built using text data, some using tabular data, and then ensemble these models to build the most robust model possible. But the point of this project is to explore some NLP advances to see how well it is to predict ratings with the review text alone (Lucas,2011). The data is a random of 10,000 reviews. This is about 15% of the total number of restaurant reviews, but it should be plenty to build reasonably accurate models that don't take hours to train. The data will be extracted from Kaggle. Jupyter Notebook with Python will be utilized to conduct exploratory analysis, as well as creating a predictive model. The baseline model will be Na?ve Bayes linear model. Generally, a linear model is appropriate and has the advantage of being fast to train. Naive Bayes linear classifier is a solid baseline for NLP problems (Prince, 2007). The exploratory data analysis, descriptive statistical analysis, and data visualization analysis will be performed in order to look at the variables identified through the literature research so that patterns and possible relationships can be identified and turned into actionable insight for Yelp.
Data Collection & Wrangling
For this research, there are going to be 1 dataset named “yelp”. This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the dataset, there is information about businesses across 11 metropolitan areas in four countries. Below are listed the fields with their types and description:
Field Name Field Type Description
Business_ID varchar Business ID to identify the business
Date date/time date
Review_id varchar review category
Stars int number of stars to determine customer satisfaction
Text str description of the review
Type str review type
User_ID varchar to identify the user
Cool str vote
Useful str vote
Funny str vote
Once the data was collected, it was then uploaded into Jupyter Notebook to begin cleaning and formatting the data. Using the head () command, the headers and data format can be observed. First things first, the data will be brought in. During the initial exploration, there were randomly selected 10,000. This is about 15% of the total number of restaurant reviews, but it should be plenty of data with result in the building reasonably accurate models that don't take hours to train. Note: during the tuning of these models, even smaller samples of this data were taken so it could quickly iterate and train models in a matter of seconds.
Descriptive statistics and Exploratory Data Analysis - Jupyter notebook using python
First, in python, the user should always start by importing the needed libraries. In this project the needed libraries are: Panda, NumPy, matplotlib.pyplot, seaborne, nltk download ('stopwords'), string, math, import CountVectorizer, train_test_split, cross_val_score, classification_report, confusion matrix, accuracy_score, roc_auc_score, roc_curve and import GridSearchCV. After importing the needed libraries, the user should load the data and see the details of the imported file (See Figure 1).
The next step is to create a new column in the dataset for the number of words in the review column. This is done in order to see if there is any correlation between the number of words in the review column and the number of “stars”. Seaborn’s FacetGrid allows to create a grid of histograms placed side by side (FacetGrid ) and to see if there is any relationship between the newly created text length feature and the stars rating. So, a comparison between the text length and number of stars seems like overall, the distribution of text length is similar across all five ratings. However, the number of text reviews seems to be skewed a lot higher towards the 4-star and 5-star ratings. This may cause some issues later on in the process. From the plot, it looks like the 1-star and 2-star ratings have much longer text, but there are many outliers (which can be seen as points above the boxes). Because of this, maybe text length won’t be such a useful feature to consider after all. (Figure 3)
The next thing is getting the mean value of the vote columns with the stars on the review and trying to find the correlation between the voting columns. The data will be then grouped by the star rating, to check if there’s a correlation between features, such as cool, useful, and funny. We can use the .corr()method from Pandas to find any correlations in the data frame.(figure 4 and figure 5)
Looking at the map, funny is strongly correlated with useful, and useful seems strongly correlated with text length. We can also see a negative correlation between cool and the other three features.
Now the data should be divided into 2 datasets; text and stars. After that, the data should be cleaned, and stop words should be removed. . The main issue in this data is that it is all in plain-text format. The classification algorithm will need some sort of feature vector in order to perform the classification task. The simplest way to convert a corpus to a vector format is the bag-of-words approach, where each unique word in a text will be represented by one number. There should be an enabling of Scikit-learn algorithms to work on the text and convert each review into a vector. Scikit-learn’s CountVectorizer can be used to convert the text collection into a matrix of token counts. This will result in the matrix as a 2-D matrix, where each row is a unique word, and each column is a review. CountVectorizer should be imported and fit an instance to the review text (stored in X), passing in the text_process function as the analyzer. To illustrate how the vectorizer works, there should be a checkup on a random review and get its bag-of-word counts as a vector. Here’s the twenty-fifth review as plain-text:
review_25 = X[24]
review_25
Output: “I love this place! I have been coming here for ages. My favorites: Elsa's Chicken sandwich, any of their burgers, dragon chicken wings, china's little chicken sandwich, and the hot pepper chicken sandwich. The atmosphere is always fun and the art they display is very abstract but totally cool!”
Data Transformation
After the vectorization process is done, their comes the transformation.
X = bow_transformer.transform(X)
The shape of the new x can now be checked .
print('Shape of Sparse Matrix: ', X.shape)
print('Amount of Non-Zero occurrences: ', X.nnz)# Percentage of non-zero values
density = (100.0 * X.nnz / (X.shape[0] * X.shape[1]))
print(‘Density: {}’.format((density)))Output:
Shape of Sparse Matrix: (4086, 26435)
Amount of Non-Zero occurrences: 222391
Density: 0.2058920276658241
Training data and test data
As process of the reviewing text in X, it’s time to split X and y into a training and a test set using train_test_split from Scikit-learn. We will use 30% of the dataset for testing.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Training the model
Multinomial Naive Bayes is a specialised version of Naive Bayes designed more for text documents. First step is to build a Multinomial Naive Bayes model and fit it to the training set (X_train and y_train).
From sklearn.naive_bayes import MultinomialNBnb = MultinomialNB()
nb.fit(X_train, y_train)
Testing and evaluating the model
The model has now been trained! It’s time to see how well it predicts the ratings of previously unseen reviews (reviews from the test set). First step is to store the predictions as a separate dataframe called preds.
preds = nb.predict(X_test)
Next, there should be an evaluation of the predictions against the actual ratings (stored in y_test) using confusion_matrix and classification_report from Scikit-learn. From sklearn.metrics import confusion_matrix, classification_reportprint(confusion_matrix(y_test, preds))
print('\n')
Print (classification_report(y_test, preds))
Output:
[[157 71]
[ 24 974]
Precision recall f1-score support
1 0.87 0.69 0.77 228
5 0.93 0.98 0.95 998avg / total 0.92 0.92 0.92 1226
It looks like the model has achieved 92% accuracy. This means that the model can predict whether a user liked a local business or not, based on what they typed.
Model
Multiple Machine Algorithm will now be used in order to see which gives the best performance.
The details and coding of these algorithms can be found in Appendix 2 of this document.
Random forest
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean/average prediction of the individual trees. In this example, random forest was applied and the results are in Appendix 2.
Decision tree
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements. In this example, decision tree was applied and the results are in Appendix 2.
Support vector machine
In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. The results are in Appendix 2.
Gradient boosting classifier
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. Below the result from this data
In the above GBC code, there’s a commented parameter evaluation code because it takes a lot of time for execution.
* Learning Rate = 0.1* Max Depth = 5* Max Features = 0.5
Hence, I used those features directly from Version 10 onwards for faster execution. If you want to see them running, you can either run version 9 or uncomment that part.
K nearest neighbor
In statistics, the k-nearest neighbors' algorithm is a non-parametric method proposed by Thomas Cover used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space
XGBoost
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way. The same code runs on a major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.
Multiplier perceptron classifier
A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptions.
From the above algorithm modeling, it is clear that:
* Multilayer Perceptron = 77.57%
* Multinomial Naive Bayes = 76.94%
* Gradient Boosting Classifier = 73.87%
* XGBoost Classifier = 70.81%
* Random Forest Classifier = 67.57%
* Decision Tree = 65.5%* K Neighbor Classifier = 61.35%
* Support Vector Machine = 59.1%
Since the multilayer perceptron classifier has the best score, it is the algorithm that will be used to predict a random positive review, a random average review, and a random negative review.
From the above, it is clear that predictions are biased towards positive reviews. It is obvious that the dataset has more positive reviews as compared to negative reviews. Data normalization is needed in order to have an equal number of reviews thereby removing the bias and having better results.
Conclusion
In this paper, the review rating prediction problem was tackled for restaurant reviews on Yelp. It was treated as a 5-class classification problem and examine various feature extraction and supervised learning methods. Experimentation and performance evaluation cross-validation yields one system, perceptron classification on the set of top 10,000 features obtained from Unigrams & Bigrams, that exhibits better predictive powers than the others. The system can be used to generate star ratings on review websites where users can write freeform text reviews without giving a star rating. Though the methods tested in this paper are extensive, they are by no means exhaustive. In fact, there are many avenues for improvements and future work. Normalizing the data is very important. Four common normalization techniques may be useful:
? scaling to a range
? clipping
? log scaling
? z-score
These techniques can be used for future work and improvements.
As an overall conclusion, it is obvious that using different machine learning algorithm for classification and prediction enrich the research and makes it more accurate. It is true that choosing one of these algorithms is crucial in order to make a final decision.
References
Antoniadis, A., Lambert-Lacroix, S., & Poggi, J.-M. (2021). Random forests for global sensitivity analysis: A selective review. Reliability Engineering & System Safety, 206, N.PAG. https://doi.org/10.1016/j.ress.2020.107312
Dhanalakshmi, R., Sri Devi, T., Varadarajan, V., Kommers, P., Piuri, V., & Subramaniyaswamy, V. (2020). Adaptive cognitive intelligence in analyzing employee feedback using LSTM. Journal of Intelligent & Fuzzy Systems, 39(6), 8069–8078. https://doi.org/10.3233/JIFS-189129
Effland, T., Lawson, A., Gravano, L., Hsu, D., Balter, S., Devinney, K., Reddy, V., & Waechter, H. (2018). Discovering foodborne illness in online restaurant reviews. Journal of the American Medical Informatics Association, 25(12), 1586–1592. https://doi.org/10.1093/jamia/ocx093
Kim, J., Trueblood, A. B., Kum, H.-C., & Shipp, E. M. (2020). Crash narrative classification: Identifying agricultural crashes using machine learning with curated keywords. Traffic Injury Prevention, 1–5. https://doi.org/10.1080/15389588.2020.1836365
Sánchez-Franco, M. J., Navarro-García, A., & Rondán-Catalu?a, F. J. (2019). A naive Bayes strategy for classifying customer satisfaction: A study based on online reviews of hospitality services. Journal of Business Research, 101, 499–506. https://doi.org/10.1016/j.jbusres.2018.12.051
Suykens, J. A. K., & Vandewalle, J. (1999). Training Multilayer Perceptron Classifiers Based on a Modified Support Vector Method. IEEE Transactions on Neural Networks, 10(4), 907. https://doi.org/10.1109/72.774254
Tian, G., Lu, L., & McIntosh, C. (2021). What factors affect consumers’ dining sentiments and their ratings: Evidence from restaurant online review data. Food Quality & Preference, 88, N.PAG. https://doi.org/10.1016/j.foodqual.2020.104060
Ver Hoef, J. M., & Temesgen, H. (2013). A comparison of the spatial linear model to Nearest Neighbor (k-NN) methods for forestry applications. PloS One, 8(3), e59129. https://doi.org/10.1371/journal.pone.0059129