How to scroll a website while crawling

How to scroll a website while crawling

Infinite scrolling?works by fetching and rendering new data every time the user scrolls down to the bottom of a page. If you are looking for an easy way to crawl a webpage with continuous or lengthy content that needs scrolling like Facebook groups, Twitter tweets, or even search results in Quora, then this guide can help save your precious time and effort.

In this article, we will show you how to create a simple web crawler that automatically scrolls a webpage using our?Crawling API?with the scroll parameter. We will be writing our code in Node.js and make this as beginner-friendly as possible.

Before we start coding, it is important to know the 3 key elements for this to work:

  • Javascript token: This is a token provided to you upon signing up at?ProxyCrawl?and it’s needed to pass the parameters below.
  • &scroll parameter: Passing this to the API will allow your request to scroll the page with an interval of 10 seconds.
  • &scroll_interval: This parameter allows the API to scroll for X seconds after loading the page. The maximum scroll interval is 60 seconds, after 60 seconds of scrolling, the API captures the data and brings it back to you.

Scrolling a website with Node

To begin, open up your command prompt (Windows) or terminal and check if you have Node.js installed on your system by typing?node --version?and if you do not have Node yet or if it’s already outdated, we recommend to download and install the?latest NodeJS version?first.

Once you have successfully installed/updated your node, go ahead, and create a folder as shown below:

No alt text provided for this image

In this instance, we will be using Visual Studio Code as an example but you may also use your favorite code editor.

Create a new file and you can name it?quoraScraper.js

No alt text provided for this image

Now we can start writing our code. First, we can declare our constant variables so we can properly call the Crawling API with the necessary parameters as shown below:

const https = require('https');

const url = encodeURIComponent('https://www.quora.com/search?q=proxycrawl');

const options = {

  hostname: 'api.proxycrawl.com',

  path: '/?token=JS_TOKEN&scraper=quora-serp&scroll=true&url=' + url,

};        

Remember that you can swap the URL with any URL that you wish to scrape which should have the corresponding?&scraper parameter?and the JS_TOKEN with your actual?javascript token.

The next part of our code will get the request in JSON format and displays the results in the console:

https

  .request(options, (response) => {

    let body = '';

    response

      .on('data', (chunk) => (body += chunk))

      .on('end', () => {

        const json = JSON.parse(body);

        console.log(json.original_status);

        console.log(json.body);

      });

  })

  .end();        

Once done, press F5 (Windows) to see the result or you may also execute this from the terminal or command prompt:

C:\Nodejs\project> node quoraScraper.js        

Since we have not set the scroll interval yet, this has defaulted to 10 seconds scrolling which naturally returns fewer data.

Fetching more data with node

Now, if you wish to scroll more (i.e. 20 seconds), you have to set a value on the?&scroll_interval?parameter. The full code is shown below:

const https = require('https');

const url = encodeURIComponent('https://www.quora.com/search?q=proxycrawl');

const options = {

  hostname: 'api.proxycrawl.com',

  path: '/?token=JS_TOKEN&scraper=quora-serp&scroll=true&scroll_interval=20&url=' + url,
};

https

  .request(options, (response) => {

    let body = '';

    response

      .on('data', (chunk) => (body += chunk))

      .on('end', () => {

        const json = JSON.parse(body);

        console.log(json.original_status);

        console.log(json.body);

      });

  })

  .end();        

Please make sure to keep your connection open up to 90 seconds if you are intending to scroll for 60 seconds. You can find more information about the scroll parameter in our?documentation.

If you run the code again, you should get more data as shown in the example below:

No alt text provided for this image

At this point, we have successfully completed a simple scraper that can scroll through a webpage in less than 20 lines of code. Remember that this can be integrated if you have an existing web scraper and you are also free to use our?ProxyCrawl Nodejs library?as an alternative.

Of course, this is just the start, there are lots of things that you can do with this and we hope it has added value to your web scraping knowledge.

要查看或添加评论,请登录

Crawlbase的更多文章