What I learned analyzing over 3,000 Amazon Alexa reviews using Natural Language Processing
As I approach the last semester of my master’s degree, I wanted to gain a deeper knowledge of machine learning and data science methodologies as I believe it would be valuable for my professional career. I decided to study Natural Language Processing during my independent study course because it is something I consider to be easily translated to the real world, has profound insights and adds meaningful results to business problems.
Note: this is a summarized version of my project. You can find the entire project report, dataset, and instructions on how to use it on my GitHub page following the link github.com/murilogustineli
What is Natural Language Processing?
Natural Language Processing (NLP) is a mixture of linguistics, computer science, information engineering, and artificial intelligence concerned with the interaction between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data [1]. There is essentially an infinite number of ways to arrange words in a sentence. It is practically impossible to give computers a dictionary of all possible sentences to help them understand what humans are talking about. Companies know that by better analyzing their data, they can improve their operations, thereby saving money and keeping their employees satisfied [2].
Methodology
For this project, I will be using a dataset containing Amazon’s Alexa reviews. This dataset consists of over 3000 Amazon customer reviews (input text), star ratings, date of review, variant and feedback of various amazon Alexa products like Alexa Echo, Echo dots, Alexa Firesticks etc. for learning how to train machine for sentiment analysis. I will also use this dataset to analyze Amazon’s Alexa product; discover insights into consumer reviews and assist with machine learning models. Moreover, I will train the machine learning models in order to obtain sentiment analysis and analyze customer reviews regarding the relationship between positive and negative reviews. The reviews were extracted directly from Amazon’s website.
In order to analyze and deliver results from the dataset, I will be using the programming language Python. Python is open source, interpreted, high level language and provides great approach for object-oriented programming. It is one of the best languages used by data scientist for various data science projects/application. The reason why I am choosing Python over other programming languages is because Python provides great functionality to deal with mathematics, statistics and scientific functions. It also provides great libraries to deal with data science application
Results
I will be using libraries such as NumPy and Pandas for calculating basic numerical and statistical operations. The libraries Matplotlib and Seaborn to show visualizations. And the library Plotly to show advanced visualizations. There are more libraries used in the complete project report.
1. Distribution of Ratings for Alexa
By looking at the pie chart above, we can conclude that most of the ratings are positive for Alexa. 72.6% customers have given Alexa a 5-star rating and 14.4% customers have given Alexa a 4-star rating. That means that 87% of the total customers have given Alexa at least a good rating. 4.83% of customers have given Alexa a 3-star rating. 3.05% of customers appear to not like Alexa as much as the other customers and chose to give only a 2-star rating to Alexa, whereas 5.11% people did not like Alexa and decided to give only 1-star rating. This feedback shows a total of 8.16% of the customers were not satisfied with Alexa. Overall, the ratings feedback is very positive, showing almost 90% of the customers being satisfied with the product.
2. Distribution of feedback for Alexa
This pie chart represents the distribution feedback for Amazon’s Alexa. 91.8% of customers have given a positive feedback for Alexa (3 stars or above), and only 8% of customers have given a negative feedback to Alexa (2 stars or below). This confirms that Alexa has a very positive feedback from the majority of its customers, and only a small percentage did not like the product.
3. Length vs Rating
The Bivariate plot shows the relationship between length and rating. Here we are looking how long the customer reviews are based on their rating. It is worth noting that all the reviews have pretty similar lengths regardless of their rating. However, there’s a clear difference between the length of low rating reviews and high rating reviews. According to the graph and as previously mentioned, low rating reviews tend to be longer than high rating reviews. Most customers that gave Alexa 5-stars wrote shorter reviews than customers that gave 1 or 2-stars. That might be due to the fact that unsatisfied customers feel the need to explain the reasons for not liking the product while satisfied customers feel happy. Hence, not having the same urgency to write long reviews.
4. Most Frequent Occurring Words
The bar plot represents the most frequent words among all of the reviews analyzed from the customers. By looking at the graph, we can have a good idea on how the customers think and feel regarding Amazon Alexa.
The words “love” and “great” are two of the most frequent words among all of the reviews which suggests that most customers had very positive feelings towards Alexa. This is foreseen since 91.8% of the reviews had a positive rating. Other frequent words that suggest Alexa is doing well are “amazing”, “like”, “easy”, “works”, and “good”.
5. Vocabulary from Reviews
This Word Cloud visualization shows all the most frequently used and most relevant words analyzed from the customer reviews. The bigger the word, the higher is the frequency for that word been written by a customer. As seen in the previous result, “love”, “great”, “like”, are very frequent words written by Alexa customers. This reinforces the customer’s positive feedback towards Alexa.
6. Natural Processing Language
There were many steps involved in getting the following results. You can find the in depth explanation in the project report on my GitHub page.
Random Forest
The random forest is a classification algorithm consisting of many decision trees. It uses bagging and feature randomness when building each individual tree to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Basically, each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models. The low correlation between models is the key. The reason for this wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly all err in the same direction). While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction [3].
Classification Report
In the field of machine learning and classification, a confusion matrix is a table that is often used to describe the performance of classification model (classifier) on a set of test data (X_test, y_test) for which the true values are known. A confusion matrix is basically a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which the classification model is confused when it makes predictions. It gives insight not only into the errors being made by a classifier but more importantly the types of errors that are being made [4].
In conclusion, the Random Forest classifier algorithm with 100 trees worked efficiently to train the model in predicting positive and negative reviews made by customers, with an overall accuracy of 94.28%. The precision percentage, in other words, the number of correct predictions among the reviews predicted to be positive is 94%. Moreover, the recall percentage of all positive reviews that the model found is 100% (perfect score). And the F1-Score, a combination of precision and recall is 97%.
The graph below is a visual representation of the results from the classification report.
Evaluating ML Models: Precision, Recall, F1-Score and Accuracy
Conclusion
In conclusion, the results show the vast majority of reviews written by Amazon’s Alexa customers were highly positive. Overall, 87% of the customers gave Alexa at least 4-star ratings. When it comes to positive and negative feedback scores, 91.8% of customers have given a positive feedback, and only 8% of customers have given a negative feedback to Alexa. This shows that Alexa customers are very pleased with their purchase. Only a small percentage had some kind of complaint towards Alexa or did not like the product.
The NLP model was very effective in predicting the difference between positive and negative reviews. With 94.28% overall accuracy, I conclude that the random forest classifier algorithm is very effective and works really well for linguistic analysis.
Self-reflection
This independent study course was fun yet challenging. I believe the skills I learned throughout the course will be extremely helpful as I am interested in pursuing a career in the technology fields of data science and business intelligence engineering. I feel more prepared to tackle real life problems, add valuable data-driven business decisions, identify better business opportunities, and spot inefficient business processes.
Thank you very much if you made it this far. If you have any questions regarding the project feel free to send me a message. Once again, this is a summarized version of the entire final report. If you would like to see the entire project, go to my GitHub page at https://github.com/murilogustineli
Resources
1. Natural Language Processing (NLP)
https://en.wikipedia.org/wiki/Natural_language_processing
2. The unexpected benefits of data analytics
https://www.cio.com/article/3249905/the-unexpected-benefits-of-data-analytics.html
3. Understanding Random Forest
https://towardsdatascience.com/understanding-random-forest-58381e0602d2
4. Confusion Matrix in Machine Learning
https://www.geeksforgeeks.org/confusion-matrix-machine-learning/