登录查看更多内容

Making Sense of Millions of Amazon Reviews Using SQL, Spark and Python - Big Data Project

Soundarya (SB) Balasubramani

I talk about immigration, entrepreneurship, & (non-linear) growth | Writing "1000 Days of Love" | Founder @ unshackled.club | 2x Author of "Unshackled" & "Admitted" | O-1A, NIW, & UK Global Talent Visa Recipient

发布日期: 2018年12月18日

This is a story of three simple students (Srivatsan Ramesh, Arpita Shah and yours truly) from a non-CS background who worked on Amazon reviews to make better sense of them.

We worked on this as part of our Big Data Analytics course where we had to work on a project involving, as you can guess, ‘big data’. Tasked with choosing a topic, we wanted to work with a dataset that was clean (thus, circumventing the atrocious hours we would otherwise spend cleaning), understandable and had a lot of potential for novelty.

We picked the Amazon Product Review Dataset created by Julian McAuley from UCSD. Now, a good question to ask would be, why this topic?

Motivation

Here’s why: We noticed that when you purchase a product, the only variables that you can look at are rating and reviews. However, this is how a typical review looks like:

Sure, you can probably read one or two of these — but thousands? There is a line where you cross from human involvement to automation, and this was that line.

We wanted to build an algorithm that brought out the features of a product by gleaning all the reviews and gave it a more appropriate and semantically intelligent rating.

We will divide the rest of the article into parts:

1. Loading the Dataset

2. Performing Analytics to Prove Motivation

3. Obtaining the Semantic Orientation and Overall Rating

4. Extracting Features (Pros and Cons) from Reviews

1. Loading the Dataset

To work with big data is a boon and a bane — yes, you get to have a corpus of data to work with. But, working with a corpus of data requires fast machines. Fortunately, we got credits to use the Google Cloud Platform virtual machine (Ubuntu 16.04 LTS with 50 GB disk) and Hadoop (and in turn Spark, Python, Hive and HBase).

We picked the Cell Phone and Accessories category inside the dataset as it was ‘big’ enough but manageable with 3,447,249 rows and 9 columns. We also picked it as it sounded like a category where people would be prone to describing product features in the reviews.

We tried various methods to load and work with the dataset, and finally settled on loading it as a JSON using SQL Context as shown below.

2. Performing Analytics to Prove Motivation

Our reason for performing analytics was to drive home the point that the verbosity of the reviews and unclear pattern between the reviews and ratings lead to a situation where people could not glean the information they needed. To do the same, we tried understanding the distribution of reviews, their length and number of people who found it helpful.

Here below are some of the charts that we created for the same:

As you can observe, more than 75% of reviews have not been found helpful by people, and among those who were found helpful, the review text has an average length of more than 150 words. While most of the reviews were given 4 or 5 stars, there is no clear pattern between the rating and the length of the reviews.

So, longer reviews are more helpful, but it does not mean people liked the product more? Well, we need a better way to untangle this mess.

3. Obtaining the Semantic Orientation and Overall Rating

Below is a flow diagram of the process we followed to get to the output.

As you can see, after the ETL (Extraction, Transformation and Loading), we cleaned the review text to remove all ‘noise’ (in this case, that translates to punctuation, junk values and upper case letters). After this, we removed the objective part of the sentence using Textblob package.

Why do this? Think about it. Would you rather read ‘This iPhone has 64 gb of storage’ (which you already know) or ‘This iPhone has high storage for low price’? Subjectivity is what we wanted from the sentence. Textblob does a good job of telling you how subjective a sentence is, using which you can eliminate those that fall below a certain threshold (we picked 0.3).

Now, coming to the part that required a grammar refresher, we need to perform Parts-of-Speech tagging (done via NLTK library) to the review and then only retain the words that followed a certain pattern. Why do this? It is said that when people describe an object’s feature, it is mostly a combination of ‘adjective-noun’, ‘adverb-adjective-noun’ and ‘adverb-verb-noun’ (there are more combinations as well).

Hence, we used the following table as a guide and extracted the combinations.

And finally, the words were fed into a semantic orientation analyzer to find out the sentiment behind it. We used the VADER algorithm created by MIT. This gives a score ranging from -1 to 1 which we normalized and then multiplied with the original rating. The final score was summed up over all the reviews of a product and then a mean was taken.

Now this score gives a more holistic representation of what the person feels.

4. Extracting Features (Pros and Cons) from Reviews

The objective of feature extraction is to extract the features mentioned by the customers in reviews and understand its polarity with respect to the context to identify it as a pro/con for the product.

Since we are trying to understand the features, they are classified as nouns in parts of speech. We need to extract such nouns from the reviews and also search for the closest adjectives to these nouns. This can be done with N-gram modelling where we partition a given sentence into groups of 2 or 3. In our scenario, we had implemented a tri-gram model because as window-size of N gram models increase, the context meaning is diminished.

Let’s take a look at the code to understand this.

From the POS tagging and subjective-objective classification, we get the frequency of nouns for a particular product to understand the distribution of features that were mentioned in the reviews. We also derived a list of common words that were mostly related to gadgets and eliminated nouns that weren’t related to the common words. We performed a tri-gram modelling on this which is described in the function below.

After the tri-gram model, we took a list of adjectives that were mapped to a given noun. This was done using the function above where we take the tri-grams containing the given noun.

Using the function, we compute the polarity of all the tri-grams containing a given product feature and take an average of these scores to get the overall polarity associated with a particular feature to understand the performance of that product feature. If it’s close to 1, it’s a ‘pro’ — a feature that most people liked. If it’s close to -1, vice versa. By computing these scores for various product features across multiple reviews, we can get an overall opinion of these features using statistical modelling rather than reading millions of reviews.

*****************************************************************

The end result is a website — developed by Srivatsan using Flask and SQL — where we created a dashboard with the product details and overall rating, and a pro/con list of the features. This concluded the project, but we hope to work more on this in the future. Hope you found this useful and enjoyable (?), and we would be sharing the video of website and our Github link soon!

*****************************************************************

If you found this to be useful, do Follow me for more articles. I love hearing your thoughts via comments! ??I write on social issues, products, the technology sector and my graduate school experience in the US. Here is my personal blog.

The best way to get in touch with me is via Instagram and Facebook. I share some interesting content there. If you're a curious soul looking to learn everyday, here's a Slack Group that I created for you to join. Happy learning!

Sai Akhilesh Chunduri

Senior MuleSoft Developer | API Development (back-end) | H1B

5 年

Well written article Soundarya. I appreciate your work. But is there any specific reason on why you have chosen to load the data set from UCSD website and not from Kaggle? What was the reason for doing that?

1 次回应

Ram Gudimetla

Founder, Entrepreneur

6 年

Cool stuff. Compare your data with AWS comprehend.

1 次回应

Vamsi Kora

Tech Investor, Advisor | ex-Founder/CEO of Gathi Analytics | Committed to help Tech Entrepreneurs build impactful companies

6 年

Awesome study,.Very clean and crisp approach. Love the fact that you shared approach and solution in detail. Congratulations. Try airbnb next.? Similar solution can be leveraged and more impactful there. Cheers?

1 次回应

Hemanthkumar E

Ad tech at Navi| Ex Dream11|Ex-Mastercard

6 年

Nice

1 次回应

查看更多评论

要查看或添加评论，请登录

Soundarya (SB) Balasubramani的更多文章

I got my O-1 visa as a founder after 2.5 years, $25,000 and a denial.

2023年11月10日

I got my O-1 visa as a founder after 2.5 years, $25,000 and a denial.

On September 12th, 2023, I got my O-1A approved as the solo founder and 100% owner of an educational venture that…

45 条评论
A year ago, everything changed today. And here's the story behind it.

2023年6月20日

A year ago, everything changed today. And here's the story behind it.

Today, Rajesh and I went for a long walk around the park in front of his house. Somewhere along the walk, I told him…

19 条评论
Unshackled #15: Layoff resources, 60-day clock, & job search on B-1/B-2

2023年4月21日

Unshackled #15: Layoff resources, 60-day clock, & job search on B-1/B-2

Hi folks, Happy Friday. ? Just in the last 3.

4 条评论
Unshackled #14: Free O-1 Chapter, masterclass on EB-1, & faster visa interviews!

2023年4月13日

Unshackled #14: Free O-1 Chapter, masterclass on EB-1, & faster visa interviews!

Hi folks, Happy Thursday. ? As promised last week, here’s the fully designed O-1 chapter from Unshackled.

4 条评论
MoAB #15: Unshackled is public now

2022年8月11日

MoAB #15: Unshackled is public now

This is my weekly edition of the "Making Of A Book" series. If you've ever been curious about how a book is created and…

2 条评论
MoAB #14: I turned 26. What does it feel like?

2022年8月5日

MoAB #14: I turned 26. What does it feel like?

This is my weekly edition of the "Making Of A Book" series. If you've ever been curious about how a book is created and…

1 条评论
Turning 26: A Letter To The Future

2022年8月2日

Turning 26: A Letter To The Future

When I turned 25 last year, I wrote a letter to my 15-year-old self sharing the 10 lessons I would teach her if I could…

17 条评论
MoAB #13: Amusing ourselves to death

2022年7月26日

MoAB #13: Amusing ourselves to death

This is my weekly edition of the "Making Of A Book" series. If you've ever been curious about how a book is created and…
MoAB #11: Solitude or social connection?

2022年7月13日

MoAB #11: Solitude or social connection?

This is my weekly edition of the "Making Of A Book" series. If you've ever been curious about how a book is created and…

2 条评论
Making Of A Book #10: Why hustle culture is addictive

2022年7月5日

Making Of A Book #10: Why hustle culture is addictive

This is my weekly edition of the "Making Of A Book" series. If you've ever been curious about how a book is created and…

See all articles

Making Sense of Millions of Amazon Reviews Using SQL, Spark and Python - Big Data Project

Soundarya (SB) Balasubramani

I talk about immigration, entrepreneurship, & (non-linear) growth | Writing "1000 Days of Love" | Founder @ unshackled.club | 2x Author of "Unshackled" & "Admitted" | O-1A, NIW, & UK Global Talent Visa Recipient

Motivation

1. Loading the Dataset

2. Performing Analytics to Prove Motivation

3. Obtaining the Semantic Orientation and Overall Rating

4. Extracting Features (Pros and Cons) from Reviews

Soundarya (SB) Balasubramani的更多文章

社区洞察

其他会员也浏览了

Data Science for beginners

Mastering the PySpark Developer Interview: Key Questions, Answers, and LinkedIn's Role

BigData Analytics with PySpark

Dask vs. Spark: Which Big Data Tool Should Data Scientists Choose?

Automation in Data Analytics: How Python and SQL Are Changing the Game

Data Warehousing with Python: A Step-by-Step Guide to Mastery

R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

Best Ways to Use Pandas with PySpark

Accessing Columns in PySpark: A Comprehensive Guide

Python in Data Engineering: Powering Databricks, Snowflake, dbt, and Airflow for Big Data Pipelines

Motivation

1. Loading the Dataset

2. Performing Analytics to Prove Motivation

3. Obtaining the Semantic Orientation and Overall Rating

4. Extracting Features (Pros and Cons) from Reviews

Soundarya (SB) Balasubramani的更多文章

I got my O-1 visa as a founder after 2.5 years, $25,000 and a denial.

A year ago, everything changed today. And here's the story behind it.

Unshackled #15: Layoff resources, 60-day clock, & job search on B-1/B-2

Unshackled #14: Free O-1 Chapter, masterclass on EB-1, & faster visa interviews!

MoAB #15: Unshackled is public now

MoAB #14: I turned 26. What does it feel like?

Turning 26: A Letter To The Future

MoAB #13: Amusing ourselves to death

MoAB #11: Solitude or social connection?

Making Of A Book #10: Why hustle culture is addictive

社区洞察

其他会员也浏览了

Data Science for beginners

Mastering the PySpark Developer Interview: Key Questions, Answers, and LinkedIn's Role

BigData Analytics with PySpark

Dask vs. Spark: Which Big Data Tool Should Data Scientists Choose?

Automation in Data Analytics: How Python and SQL Are Changing the Game

Data Warehousing with Python: A Step-by-Step Guide to Mastery

R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

Best Ways to Use Pandas with PySpark

Accessing Columns in PySpark: A Comprehensive Guide

Python in Data Engineering: Powering Databricks, Snowflake, dbt, and Airflow for Big Data Pipelines