How to create a scraper with a? headless browser using Puppeteer and node.js
https://openverse.org/en-gb/image/6d089631-5183-47c9-a515-b8412aa5a198?q=macbook

How to create a scraper with a? headless browser using Puppeteer and node.js

At some point we need to collect data from some websites, the normal process is to talk with the owners of companies because you would like to buy a subscription to get that data and probably close an agreement, but what happens if that site did not respond or even worst respond by a negative answer saying they will not provide an API to connect another 3rd party, belief or not it happen a bunch of times and that is why companies develop web crawlers/scrapers to automate getting HTML valuable data, clean it, save in a secure place to use later. There are common myths about if scraping is legal or not, here is a good article about it.?

What is web scraping?

It is the action to load a webpage, downloading the HTML, getting the target information of interest from that HTML, transforming that data, and storing it in an accessible database to use later or process it. There are multiple implementations in different ways and different stack languages like:?

To mention a few of them.

Also, exists different ways to achieve it, there are small libraries and web clients that download HTML too. My focus is on a headless browser.

What makes a special headless browser??

Basically, it is open a browser window process without a GUI but running behind the scenes, is a little bit confusing but that is how it works. Headless browsers allow you to launch a script or program acting as a normal user, even you can automate to scroll down and click buttons to retrieve more data. So that makes it hard for web scraper detection to block the request because it acts as a “user”. Also, allow you to choose which browser you want to use.

Benefits of web scraping

For the company is being scraped

  • If decide to not sell an API, they are generating traffic from automated web crawler servers
  • After studying the traffic, the company should consider selling an API to get more profits.

For the company will scrape

  • Get valuable data from targeting websites

Lets code!

As I mentioned, you can use the stack of your preference, launch your favorite browser, and so on. In my case, I will use puppeteer, typescript, and chrome browser by default, you can choose the browser product like Chrome, Firefox, and so on.

(Link to the code source is at the end of the post)

Assuming you have npm, nodejs, and typescript installed.?

  • Install typescript:?npm install -g typescript
  • tsc —init
  • npm install puppeteer

Creates the following structure of folders?

No alt text provided for this image

Inside the scraper, let's create first the headless handler

import { EventEmitter } from 'events'
import * as Puppeteer from 'puppeteer';
	

export class HeadlessBrowser extends EventEmitter {
  protected browser: Puppeteer.Browser;
};        

I am extending the event emitter class to facilitate the work to send events to the headless window to open, close, and show us logs inside scraping statements, useful for debugging our queries.?

Init function:

private async initBrowser() {
  const browserArgs: PuppeteerLaunchOptions = {
	args: [
	   '--no-sandbox',
	   '--disable-setuid-sandbox',
	   '--headless',
	   '--disable-gpu',
	   '--disable-dev-shm-usage',
	   '--disable-web-security',
	   '--disable-infobars',
	   '--window-position=0,0',
	   '--ignore-certifcate-errors',
	   '--ignore-certifcate-errors-spki-list',
	 ],
   };
   this.browser = await Puppeteer.launch(browserArgs);
}        

We need to make sure when the browser window is “open” (run in the background) ignore some errors, run in headless mode (remove the headless param and you can see how a window is opened), and match our requirements, I am using the basic ones, you can see more options here.?

Get page

public async getPage() {
  if (!this.browser) {
	await this.initBrowser();
  }
  const page = await this.browser.newPage();
	
  // Avoiding Bot detection
  const userAgent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.39 Safari/537.36';
  await page.setUserAgent(userAgent);
	
  page.on('console', msg => {
	this.emit('console', msg.text());
  });

  return page;
}
        

We are checking if our browser window exists, if does not exist we create them.

With the puppeteer API, we are getting a new page. A small hack is avoiding bot detection by setting the user agent,? finally, we are streaming the logs to debug our queries.

Close function

public closeBrowser() {
  if (this.browser) {
	 this.browser.close();
  }
}        

After we complete scraping the targeting data we need to close the window instance.?

The scraper code

export class Scraper {
  protected name: string;
  protected baseURL: string;
  protected browser = new HeadlessBrowser();
	
  constructor() {
    this.name = 'Scraper';
    this.baseURL = 'https://github.com/trending';
  }
}        

The scraper class, is class will contain all operations we need to do with headlessbrowser instance, we can have multiples but at this time, for brevity, we will use one instance. As a didactical example let’s use 'https://github.com/trending' to retrieve the top repos and their number of stars.

Github has already an API to get all trending repos and a number of stars, I am using it as a didactical example, as I mentioned when you need data from a website make sure the API exists, it does not exist, scrape it!
const page = await this.browser.getPage();
	
await page.goto(this.baseURL, {
  timeout: 300000,
  waitUntil: 'networkidle0',
});        

First, we open a page using our browser instance, and later we will open the URL, similar way you write a URL typing in the search bar. I pass the timeout and wait until params to avoid the page closing after the automated interactions close before scraping data.

The evaluate function?

const data = await page.evaluate(() => {
  // Your scraping code here 
});        

This is the callback function where we are putting all document DOM code to extract data from the HTML code, at this time we back to the basis, remembering a little bit of DOM code.

Inside evaluate callback, for this particular use case let’s use this code:

const allReposArticles = document.querySelectorAll('.Box-row h1.lh-condensed a');
const allReposArray = Array.from(allReposArticles);
const allNamesRepos = allReposArray.map((item: HTMLElement | any) => {
  return { name: item.innerText };
});        

When we run the document.queryselectorAll code on a web console we can see:

No alt text provided for this image

QuerySelectAll is returning a collection of HTML Elements that match our interest and the collection data we want to retrieve.

Because we receive a collection of Elements, we need to transform them into arrays to manipulate that data. Finally, we map that array data into a handy JSON to be processed later.

The next thing is to get the number of stars

const regexMatchDigits = /\d+/g
const allStarArticles = document.querySelectorAll(
  '.Box-row .d-inline-block.float-sm-right',
);
const allStarReposArray = Array.from(allStarArticles);
const allStarsRepos = allStarReposArray.map((item: HTMLElement | any) => {
  const starDigits = item.innerText.match(regexMatchDigits);
  return { stars: Number(starDigits[0]) };
});        

The code is self-explanatory, the only difference is we are matching the stars' number digits.

And finally merging those values into one object.?


const dataMerged = allNamesRepos.map((repo: any, index: number) => {
  const obj = {
	name: repo.name,
	starsToday: allStarsRepos[index].stars,
  };
  return obj;
});        

Inside the page evaluate function you can add console.info() statements to debug your queries or use the web console as well.

In our design, we created the main.ts file which is the guy that will call our scraper code to retrieve data.


import { Scraper } from './scraper'
	

async function bootstrap() {
  const scraper = new Scraper();
  try {
	 const data = await scraper.scrape();
	 console.log(data);
  } catch (e) {
	 console.error('error', e);
  }
}
bootstrap();        

Code repo: https://github.com/hendrixroa/skelleton-scraper?

Now, you are able to scrape your data and store it in the DB of your preference, the code source has been kept as simple as I can in order that would be easy to port into your typescript codebase. Check this if you need more inspiration https://github.com/puppeteer/puppeteer/tree/main/examples and see how people use Puppeteer for awesome projects.


Thanks for reading!?

要查看或添加评论,请登录

Hendrix Roa的更多文章

社区洞察

其他会员也浏览了