How to scroll a website while crawling
Infinite scrolling?works by fetching and rendering new data every time the user scrolls down to the bottom of a page. If you are looking for an easy way to crawl a webpage with continuous or lengthy content that needs scrolling like Facebook groups, Twitter tweets, or even search results in Quora, then this guide can help save your precious time and effort.
In this article, we will show you how to create a simple web crawler that automatically scrolls a webpage using our?Crawling API?with the scroll parameter. We will be writing our code in Node.js and make this as beginner-friendly as possible.
Before we start coding, it is important to know the 3 key elements for this to work:
Scrolling a website with Node
To begin, open up your command prompt (Windows) or terminal and check if you have Node.js installed on your system by typing?node --version?and if you do not have Node yet or if it’s already outdated, we recommend to download and install the?latest NodeJS version?first.
Once you have successfully installed/updated your node, go ahead, and create a folder as shown below:
In this instance, we will be using Visual Studio Code as an example but you may also use your favorite code editor.
Create a new file and you can name it?quoraScraper.js
Now we can start writing our code. First, we can declare our constant variables so we can properly call the Crawling API with the necessary parameters as shown below:
const https = require('https');
const url = encodeURIComponent('https://www.quora.com/search?q=proxycrawl');
const options = {
hostname: 'api.proxycrawl.com',
path: '/?token=JS_TOKEN&scraper=quora-serp&scroll=true&url=' + url,
};
Remember that you can swap the URL with any URL that you wish to scrape which should have the corresponding?&scraper parameter?and the JS_TOKEN with your actual?javascript token.
The next part of our code will get the request in JSON format and displays the results in the console:
https
.request(options, (response) => {
let body = '';
response
.on('data', (chunk) => (body += chunk))
.on('end', () => {
const json = JSON.parse(body);
console.log(json.original_status);
console.log(json.body);
});
})
.end();
Once done, press F5 (Windows) to see the result or you may also execute this from the terminal or command prompt:
C:\Nodejs\project> node quoraScraper.js
Since we have not set the scroll interval yet, this has defaulted to 10 seconds scrolling which naturally returns fewer data.
Fetching more data with node
Now, if you wish to scroll more (i.e. 20 seconds), you have to set a value on the?&scroll_interval?parameter. The full code is shown below:
const https = require('https');
const url = encodeURIComponent('https://www.quora.com/search?q=proxycrawl');
const options = {
hostname: 'api.proxycrawl.com',
path: '/?token=JS_TOKEN&scraper=quora-serp&scroll=true&scroll_interval=20&url=' + url,
};
https
.request(options, (response) => {
let body = '';
response
.on('data', (chunk) => (body += chunk))
.on('end', () => {
const json = JSON.parse(body);
console.log(json.original_status);
console.log(json.body);
});
})
.end();
Please make sure to keep your connection open up to 90 seconds if you are intending to scroll for 60 seconds. You can find more information about the scroll parameter in our?documentation.
If you run the code again, you should get more data as shown in the example below:
At this point, we have successfully completed a simple scraper that can scroll through a webpage in less than 20 lines of code. Remember that this can be integrated if you have an existing web scraper and you are also free to use our?ProxyCrawl Nodejs library?as an alternative.
Of course, this is just the start, there are lots of things that you can do with this and we hope it has added value to your web scraping knowledge.