登录查看更多内容

What I learned analyzing over 3,000 Amazon Alexa reviews using Natural Language Processing

Murilo Gustineli

Senior AI Software Engineer @ Intel, CS and ML @ Georgia Tech

发布日期: 2020年5月1日

As I approach the last semester of my master’s degree, I wanted to gain a deeper knowledge of machine learning and data science methodologies as I believe it would be valuable for my professional career. I decided to study Natural Language Processing during my independent study course because it is something I consider to be easily translated to the real world, has profound insights and adds meaningful results to business problems.

Note: this is a summarized version of my project. You can find the entire project report, dataset, and instructions on how to use it on my GitHub page following the link github.com/murilogustineli

What is Natural Language Processing?

Natural Language Processing (NLP) is a mixture of linguistics, computer science, information engineering, and artificial intelligence concerned with the interaction between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data [1]. There is essentially an infinite number of ways to arrange words in a sentence. It is practically impossible to give computers a dictionary of all possible sentences to help them understand what humans are talking about. Companies know that by better analyzing their data, they can improve their operations, thereby saving money and keeping their employees satisfied [2].

Methodology

For this project, I will be using a dataset containing Amazon’s Alexa reviews. This dataset consists of over 3000 Amazon customer reviews (input text), star ratings, date of review, variant and feedback of various amazon Alexa products like Alexa Echo, Echo dots, Alexa Firesticks etc. for learning how to train machine for sentiment analysis. I will also use this dataset to analyze Amazon’s Alexa product; discover insights into consumer reviews and assist with machine learning models. Moreover, I will train the machine learning models in order to obtain sentiment analysis and analyze customer reviews regarding the relationship between positive and negative reviews. The reviews were extracted directly from Amazon’s website.

In order to analyze and deliver results from the dataset, I will be using the programming language Python. Python is open source, interpreted, high level language and provides great approach for object-oriented programming. It is one of the best languages used by data scientist for various data science projects/application. The reason why I am choosing Python over other programming languages is because Python provides great functionality to deal with mathematics, statistics and scientific functions. It also provides great libraries to deal with data science application

Results

I will be using libraries such as NumPy and Pandas for calculating basic numerical and statistical operations. The libraries Matplotlib and Seaborn to show visualizations. And the library Plotly to show advanced visualizations. There are more libraries used in the complete project report.

1. Distribution of Ratings for Alexa

By looking at the pie chart above, we can conclude that most of the ratings are positive for Alexa. 72.6% customers have given Alexa a 5-star rating and 14.4% customers have given Alexa a 4-star rating. That means that 87% of the total customers have given Alexa at least a good rating. 4.83% of customers have given Alexa a 3-star rating. 3.05% of customers appear to not like Alexa as much as the other customers and chose to give only a 2-star rating to Alexa, whereas 5.11% people did not like Alexa and decided to give only 1-star rating. This feedback shows a total of 8.16% of the customers were not satisfied with Alexa. Overall, the ratings feedback is very positive, showing almost 90% of the customers being satisfied with the product.

2. Distribution of feedback for Alexa

This pie chart represents the distribution feedback for Amazon’s Alexa. 91.8% of customers have given a positive feedback for Alexa (3 stars or above), and only 8% of customers have given a negative feedback to Alexa (2 stars or below). This confirms that Alexa has a very positive feedback from the majority of its customers, and only a small percentage did not like the product.

3. Length vs Rating

The Bivariate plot shows the relationship between length and rating. Here we are looking how long the customer reviews are based on their rating. It is worth noting that all the reviews have pretty similar lengths regardless of their rating. However, there’s a clear difference between the length of low rating reviews and high rating reviews. According to the graph and as previously mentioned, low rating reviews tend to be longer than high rating reviews. Most customers that gave Alexa 5-stars wrote shorter reviews than customers that gave 1 or 2-stars. That might be due to the fact that unsatisfied customers feel the need to explain the reasons for not liking the product while satisfied customers feel happy. Hence, not having the same urgency to write long reviews.

4. Most Frequent Occurring Words

The bar plot represents the most frequent words among all of the reviews analyzed from the customers. By looking at the graph, we can have a good idea on how the customers think and feel regarding Amazon Alexa.

The words “love” and “great” are two of the most frequent words among all of the reviews which suggests that most customers had very positive feelings towards Alexa. This is foreseen since 91.8% of the reviews had a positive rating. Other frequent words that suggest Alexa is doing well are “amazing”, “like”, “easy”, “works”, and “good”.

5. Vocabulary from Reviews

This Word Cloud visualization shows all the most frequently used and most relevant words analyzed from the customer reviews. The bigger the word, the higher is the frequency for that word been written by a customer. As seen in the previous result, “love”, “great”, “like”, are very frequent words written by Alexa customers. This reinforces the customer’s positive feedback towards Alexa.

6. Natural Processing Language

There were many steps involved in getting the following results. You can find the in depth explanation in the project report on my GitHub page.

Random Forest

The random forest is a classification algorithm consisting of many decision trees. It uses bagging and feature randomness when building each individual tree to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree. Basically, each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models. The low correlation between models is the key. The reason for this wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly all err in the same direction). While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction [3].

Classification Report

In the field of machine learning and classification, a confusion matrix is a table that is often used to describe the performance of classification model (classifier) on a set of test data (X_test, y_test) for which the true values are known. A confusion matrix is basically a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which the classification model is confused when it makes predictions. It gives insight not only into the errors being made by a classifier but more importantly the types of errors that are being made [4].

In conclusion, the Random Forest classifier algorithm with 100 trees worked efficiently to train the model in predicting positive and negative reviews made by customers, with an overall accuracy of 94.28%. The precision percentage, in other words, the number of correct predictions among the reviews predicted to be positive is 94%. Moreover, the recall percentage of all positive reviews that the model found is 100% (perfect score). And the F1-Score, a combination of precision and recall is 97%.

The graph below is a visual representation of the results from the classification report.

Evaluating ML Models: Precision, Recall, F1-Score and Accuracy

Conclusion

In conclusion, the results show the vast majority of reviews written by Amazon’s Alexa customers were highly positive. Overall, 87% of the customers gave Alexa at least 4-star ratings. When it comes to positive and negative feedback scores, 91.8% of customers have given a positive feedback, and only 8% of customers have given a negative feedback to Alexa. This shows that Alexa customers are very pleased with their purchase. Only a small percentage had some kind of complaint towards Alexa or did not like the product.

The NLP model was very effective in predicting the difference between positive and negative reviews. With 94.28% overall accuracy, I conclude that the random forest classifier algorithm is very effective and works really well for linguistic analysis.

Self-reflection

This independent study course was fun yet challenging. I believe the skills I learned throughout the course will be extremely helpful as I am interested in pursuing a career in the technology fields of data science and business intelligence engineering. I feel more prepared to tackle real life problems, add valuable data-driven business decisions, identify better business opportunities, and spot inefficient business processes.

Thank you very much if you made it this far. If you have any questions regarding the project feel free to send me a message. Once again, this is a summarized version of the entire final report. If you would like to see the entire project, go to my GitHub page at https://github.com/murilogustineli

Resources

1. Natural Language Processing (NLP)

https://en.wikipedia.org/wiki/Natural_language_processing

2. The unexpected benefits of data analytics

https://www.cio.com/article/3249905/the-unexpected-benefits-of-data-analytics.html

3. Understanding Random Forest

https://towardsdatascience.com/understanding-random-forest-58381e0602d2

4. Confusion Matrix in Machine Learning

https://www.geeksforgeeks.org/confusion-matrix-machine-learning/

Murilo Gustineli的更多文章

What I learned analyzing the famous Titanic dataset

2020年11月9日

What I learned analyzing the famous Titanic dataset

As I approach my last month of grad school, I got myself reminiscing about the classes I enjoyed the most during my…

7 条评论

What I learned analyzing over 3,000 Amazon Alexa reviews using Natural Language Processing

Murilo Gustineli

Senior AI Software Engineer @ Intel, CS and ML @ Georgia Tech

What is Natural Language Processing?

Methodology

Results

1. Distribution of Ratings for Alexa

2. Distribution of feedback for Alexa

3. Length vs Rating

4. Most Frequent Occurring Words

5. Vocabulary from Reviews

6. Natural Processing Language

Random Forest

Classification Report

Evaluating ML Models: Precision, Recall, F1-Score and Accuracy

Conclusion

Self-reflection

Resources

Murilo Gustineli的更多文章

社区洞察

其他会员也浏览了

?? Unlocking NLP Mastery: BERT + Python in Action ??

Developing LLMs for Generative AI Tokenization and Vectorization

NLP with Python Part 2 NLTK

Natural Language Processing Using Python or NodeJS

(NLP) Python Libraries - A Comprehensive Guide

Developing Large Language Models - A Simplified Guide

Exploring Transfer Learning with Python: Leveraging Pre-Trained Models for New Tasks

Why Choose Python for NLP? A Comprehensive Guide

Build Chatbots with Python training

Natural Language Processing (NLP) With Python's NLTK Package

What is Natural Language Processing?

Methodology

Results

1. Distribution of Ratings for Alexa

2. Distribution of feedback for Alexa

3. Length vs Rating

4. Most Frequent Occurring Words

5. Vocabulary from Reviews

6. Natural Processing Language

Random Forest

Classification Report

Evaluating ML Models: Precision, Recall, F1-Score and Accuracy

Conclusion

Self-reflection

Resources

Murilo Gustineli的更多文章

What I learned analyzing the famous Titanic dataset

社区洞察

其他会员也浏览了

?? Unlocking NLP Mastery: BERT + Python in Action ??

Developing LLMs for Generative AI Tokenization and Vectorization

NLP with Python Part 2 NLTK

Natural Language Processing Using Python or NodeJS

(NLP) Python Libraries - A Comprehensive Guide

Developing Large Language Models - A Simplified Guide

Exploring Transfer Learning with Python: Leveraging Pre-Trained Models for New Tasks

Why Choose Python for NLP? A Comprehensive Guide

Build Chatbots with Python training

Natural Language Processing (NLP) With Python's NLTK Package