How to scrape Amazon reviews in Node

If you need to get the reviews of different products you can quickly do it in Node, as the async features of Node help you to get the data from Amazon easily.

In this article, we will together scrape Amazon reviews and comments using only a couple of NodeJS libraries.

The first thing that we are going to need is a list of Amazon URLs, for the example here we will use amazon.com URLs. We have collected a sample of around 1,000 product ASIN which you can download from here.

Amazon products list download

Loading Amazon URLs in Node

Let’s create a file?start.js?which will contain our node code.

Let’s also install our two requirements:

npm i cheerio
npm i proxycrawl

Now we should have our project structure with at least the following files:

Now it’s time to start coding. Let’s write our code in the?start.js?file, and we will start by loading the amazon-products.txt file into an array. We can do that with the following piece of code:

const fs = require('fs');
const file = fs.readFileSync('amazon-products.txt');
const urls = file.toString().split('\n');

console.log(urls);

Now that we have the URLs in an array, we can start crawling them. We will use the?ProxyCrawl node library?that we installed before.

Crawling Amazon with ProxyCrawl

We need to initialize the library and create a worker with our token. For Amazon, we should use the normal token, make sure to replace it with your actual token from?your account.

We have to add the following two lines to our project:

const { ProxyCrawlAPI } = require('proxycrawl');
const api = new ProxyCrawlAPI({ token: 'YOUR_TOKEN' });

With the resulting code being the following:

const fs = require('fs');
const { ProxyCrawlAPI } = require('proxycrawl');

const file = fs.readFileSync('amazon-products.txt');
const urls = file.toString().split('\n');
const api = new ProxyCrawlAPI({ token: 'YOUR_TOKEN' });

The time now to crawl the URLs, we will do 10 requests each second which should suffice for our test, but if you need more, make sure to?contact ProxyCrawl.

Let’s build our code to send 10 API requests each second…

const requestsPerSecond = 10;
var currentIndex = 0;
setInterval(() => {
  for (let i = 0; i < requestsPerSecond; i++) {
    api.get(urls[currentIndex]);
    currentIndex++;
  }
}, 1000);

We are now loading the URLs, but we are not doing anything with the result. So it’s now time to start scraping ??

Scraping Amazon reviews

We will use?the Node Cheerio library?that we installed before to parse the resulting HTML and extract only the reviews.

Let’s first include cheerio:

const cheerio = require('cheerio');

And now let’s build a function that should receive the HTML and parse it accordingly.

function parseHtml(html) {
  // Load the html in cheerio
  const $ = cheerio.load(html);
  // Load the reviews
  const reviews = $('.review');
  reviews.each((i, review) => {
    // Find the text children
    const textReview = $(review).find('.review-text').text();
    console.log(textReview);
  });
}

So now we have the text content of the reviews, we are close to finishing the scraping, but we are missing the most crucial part, which is to connect our function with the previous piece of code we had. When we did the call to the?ProxyCrawl API.

The full code should look like the following:

const fs = require('fs');
const { ProxyCrawlAPI } = require('proxycrawl');
const cheerio = require('cheerio');

const file = fs.readFileSync('amazon-products.txt');
const urls = file.toString().split('\n');
const api = new ProxyCrawlAPI({ token: 'YOUR_TOKEN' });

function parseHtml(html) {
  // Load the html in cheerio
  const $ = cheerio.load(html);
  // Load the reviews
  const reviews = $('.review');
  reviews.each((i, review) => {
    // Find the text children
    const textReview = $(review).find('.review-text').text();
    console.log(textReview);
  });
}

const requestsPerSecond = 10;
var currentIndex = 0;
setInterval(() => {
  for (let i = 0; i < requestsPerSecond; i++) {
    api.get(urls[currentIndex]).then((response) => {
      // Make sure the response is success
      if (response.statusCode === 200 && response.originalStatus === 200) {
        parseHtml(response.body);
      } else {
        console.log('Failed: ', response.statusCode, response.originalStatus);
      }
    });
    currentIndex++;
  }
}, 1000);

The code is ready, and you can quickly scrape 10 Amazon reviews each second. Obviously, for this post we are just logging in to the console the results, you should replace that?console.log?with whatever you would like to do. It can be saved in a database, save in a file, etc. That is up to you.

We hope you enjoyed this tutorial and we hope to see you soon in?ProxyCrawl. Happy crawling!

How to scrape Amazon reviews in Node

Crawlbase

Proxy. Crawl. Scale. All-In-One data crawling and scraping platform for business developers.

Loading Amazon URLs in Node

Crawling Amazon with ProxyCrawl

Scraping Amazon reviews

Crawlbase的更多文章

Loading Amazon URLs in Node

Crawling Amazon with ProxyCrawl

Scraping Amazon reviews

Crawlbase的更多文章

7 Myths About Web Scraping

What Makes ProxyCrawl Different From Premium Proxy Services

Web Scraping API Tools to Track, Manage and Visualize Your Data Pipeline

How Automating Data Scraping with Dynamic filter Result in Better and Smart Ticket Booking?

How to Create a Million Dollar Business using Web scraping services

Scraping Email Addresses From The Web Can Be Beneficial For Your Business

How to Scrape Data Anonymously?

5 Best Web Scraping Tools to Extract Online Data

How to scrape Facebook groups with PyCharm

How to scroll a website while crawling