ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

How to create a scraper with a? headless browser using Puppeteer and node.js

Hendrix Roa

Building Scalable, Resilient & User-Focused Systems | AWS & IaC (CDK, CDK8s) | Expertise in Backend Systems (Python, Node.js) | Solving Complex Problems, Reducing Tech Debt & Driving High-Impact Architecture Patterns

å‘å¸ƒæ—¥æœŸ: 2023å¹´6æœˆ7æ—¥

At some point we need to collect data from some websites, the normal process is to talk with the owners of companies because you would like to buy a subscription to get that data and probably close an agreement, but what happens if that site did not respond or even worst respond by a negative answer saying they will not provide an API to connect another 3rd party, belief or not it happen a bunch of times and that is why companies develop web crawlers/scrapers to automate getting HTML valuable data, clean it, save in a secure place to use later. There are common myths about if scraping is legal or not, here is a good article about it.?

What is web scraping?

It is the action to load a webpage, downloading the HTML, getting the target information of interest from that HTML, transforming that data, and storing it in an accessible database to use later or process it. There are multiple implementations in different ways and different stack languages like:?

To mention a few of them.

Also, exists different ways to achieve it, there are small libraries and web clients that download HTML too. My focus is on a headless browser.

What makes a special headless browser??

Basically, it is open a browser window process without a GUI but running behind the scenes, is a little bit confusing but that is how it works. Headless browsers allow you to launch a script or program acting as a normal user, even you can automate to scroll down and click buttons to retrieve more data. So that makes it hard for web scraper detection to block the request because it acts as a â€œuserâ€. Also, allow you to choose which browser you want to use.

Benefits of web scraping

For the company is being scraped

If decide to not sell an API, they are generating traffic from automated web crawler servers
After studying the traffic, the company should consider selling an API to get more profits.

For the company will scrape

Get valuable data from targeting websites

Lets code!

As I mentioned, you can use the stack of your preference, launch your favorite browser, and so on. In my case, I will use puppeteer, typescript, and chrome browser by default, you can choose the browser product like Chrome, Firefox, and so on.

(Link to the code source is at the end of the post)

Assuming you have npm, nodejs, and typescript installed.?

Install typescript:?npm install -g typescript
tsc â€”init
npm install puppeteer

Creates the following structure of folders?

Inside the scraper, let's create first the headless handler

import { EventEmitter } from 'events'
import * as Puppeteer from 'puppeteer';
	

export class HeadlessBrowser extends EventEmitter {
  protected browser: Puppeteer.Browser;
};

I am extending the event emitter class to facilitate the work to send events to the headless window to open, close, and show us logs inside scraping statements, useful for debugging our queries.?

Init function:

private async initBrowser() {
  const browserArgs: PuppeteerLaunchOptions = {
	args: [
	   '--no-sandbox',
	   '--disable-setuid-sandbox',
	   '--headless',
	   '--disable-gpu',
	   '--disable-dev-shm-usage',
	   '--disable-web-security',
	   '--disable-infobars',
	   '--window-position=0,0',
	   '--ignore-certifcate-errors',
	   '--ignore-certifcate-errors-spki-list',
	 ],
   };
   this.browser = await Puppeteer.launch(browserArgs);
}

We need to make sure when the browser window is â€œopenâ€ (run in the background) ignore some errors, run in headless mode (remove the headless param and you can see how a window is opened), and match our requirements, I am using the basic ones, you can see more options here.?

Get page

public async getPage() {
  if (!this.browser) {
	await this.initBrowser();
  }
  const page = await this.browser.newPage();
	
  // Avoiding Bot detection
  const userAgent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.39 Safari/537.36';
  await page.setUserAgent(userAgent);
	
  page.on('console', msg => {
	this.emit('console', msg.text());
  });

  return page;
}

We are checking if our browser window exists, if does not exist we create them.

With the puppeteer API, we are getting a new page. A small hack is avoiding bot detection by setting the user agent,? finally, we are streaming the logs to debug our queries.

é¢†è‹±æŽ¨è

Control Structures and Data Handling

JavaScript Developer WorldWide 10 ä¸ªæœˆå‰

JavaScript Algorithms and Data Structures: Comprehensive Guide

JavaScript Algorithms and Data Structures:â€¦

JavaScript Developer WorldWide 3 ä¸ªæœˆå‰

Using fetch() to Make HTTP Requests in JavaScript

JavaScript Developer WorldWide 3 ä¸ªæœˆå‰

Close function

public closeBrowser() {
  if (this.browser) {
	 this.browser.close();
  }
}

After we complete scraping the targeting data we need to close the window instance.?

The scraper code

export class Scraper {
  protected name: string;
  protected baseURL: string;
  protected browser = new HeadlessBrowser();
	
  constructor() {
    this.name = 'Scraper';
    this.baseURL = 'https://github.com/trending';
  }
}

The scraper class, is class will contain all operations we need to do with headlessbrowser instance, we can have multiples but at this time, for brevity, we will use one instance. As a didactical example letâ€™s use 'https://github.com/trending' to retrieve the top repos and their number of stars.

Github has already an API to get all trending repos and a number of stars, I am using it as a didactical example, as I mentioned when you need data from a website make sure the API exists, it does not exist, scrape it!

const page = await this.browser.getPage();
	
await page.goto(this.baseURL, {
  timeout: 300000,
  waitUntil: 'networkidle0',
});

First, we open a page using our browser instance, and later we will open the URL, similar way you write a URL typing in the search bar. I pass the timeout and wait until params to avoid the page closing after the automated interactions close before scraping data.

The evaluate function?

const data = await page.evaluate(() => {
  // Your scraping code here 
});

This is the callback function where we are putting all document DOM code to extract data from the HTML code, at this time we back to the basis, remembering a little bit of DOM code.

Inside evaluate callback, for this particular use case letâ€™s use this code:

const allReposArticles = document.querySelectorAll('.Box-row h1.lh-condensed a');
const allReposArray = Array.from(allReposArticles);
const allNamesRepos = allReposArray.map((item: HTMLElement | any) => {
  return { name: item.innerText };
});

When we run the document.queryselectorAll code on a web console we can see:

QuerySelectAll is returning a collection of HTML Elements that match our interest and the collection data we want to retrieve.

Because we receive a collection of Elements, we need to transform them into arrays to manipulate that data. Finally, we map that array data into a handy JSON to be processed later.

The next thing is to get the number of stars

const regexMatchDigits = /\d+/g
const allStarArticles = document.querySelectorAll(
  '.Box-row .d-inline-block.float-sm-right',
);
const allStarReposArray = Array.from(allStarArticles);
const allStarsRepos = allStarReposArray.map((item: HTMLElement | any) => {
  const starDigits = item.innerText.match(regexMatchDigits);
  return { stars: Number(starDigits[0]) };
});

The code is self-explanatory, the only difference is we are matching the stars' number digits.

And finally merging those values into one object.?

const dataMerged = allNamesRepos.map((repo: any, index: number) => {
  const obj = {
	name: repo.name,
	starsToday: allStarsRepos[index].stars,
  };
  return obj;
});

Inside the page evaluate function you can add console.info() statements to debug your queries or use the web console as well.

In our design, we created the main.ts file which is the guy that will call our scraper code to retrieve data.

import { Scraper } from './scraper'
	

async function bootstrap() {
  const scraper = new Scraper();
  try {
	 const data = await scraper.scrape();
	 console.log(data);
  } catch (e) {
	 console.error('error', e);
  }
}
bootstrap();

Code repo: https://github.com/hendrixroa/skelleton-scraper?

Now, you are able to scrape your data and store it in the DB of your preference, the code source has been kept as simple as I can in order that would be easy to port into your typescript codebase. Check this if you need more inspiration https://github.com/puppeteer/puppeteer/tree/main/examples and see how people use Puppeteer for awesome projects.

Thanks for reading!?

Hendrix Roa Publications

374 ä½å…³æ³¨è€…

è®¢é˜…

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Hendrix Roaçš„æ›´å¤šæ–‡ç«

Current state of IT recruitment process: focusing developers

2024å¹´9æœˆ25æ—¥

Current state of IT recruitment process: focusing developers

The major complaints about the recruitment process come from the automated tools involved in the stages, ATS systemsâ€¦
Understanding the MVC Pattern: From Its Origins to Modern Development

2024å¹´9æœˆ13æ—¥

Understanding the MVC Pattern: From Its Origins to Modern Development

The Model-View-Controller (MVC) design pattern is a cornerstone of software architecture. It divides an applicationâ€¦
Understanding Technical Debt

2024å¹´9æœˆ10æ—¥

Understanding Technical Debt

In the world of software development, you may often hear the term technical debt tossed around. But what exactly is it,â€¦
The Duties of a Software Developer Beyond Just Writing Code

2024å¹´9æœˆ9æ—¥

The Duties of a Software Developer Beyond Just Writing Code

Software development is often seen as the art of writing efficient, clean, and functional code. But experiencedâ€¦

2 æ¡è¯„è®º
A Guide for Effective Communication in Development Using Technical Documentation

2024å¹´9æœˆ7æ—¥

A Guide for Effective Communication in Development Using Technical Documentation

In the fast-paced world of software development, documentation is often overlooked, yet it plays a crucial role inâ€¦
Essential Programming Languages and Linter Tools for Modern Development

2024å¹´9æœˆ6æ—¥

Essential Programming Languages and Linter Tools for Modern Development

In one of my previous posts I published a guideline for coding based on metrics, in modern development there are aâ€¦
Essential Software Development Practices: A Comprehensive Guide

2024å¹´9æœˆ6æ—¥

Essential Software Development Practices: A Comprehensive Guide

Software development is inherently complex, requiring a deep understanding of principles and best practices to manageâ€¦
Guideline metrics to create better code

2024å¹´9æœˆ5æ—¥

Guideline metrics to create better code

When it comes to code metrics like class size, function parameters, and related guidelines, best practices can varyâ€¦
Software engineering in times of AI

2024å¹´2æœˆ7æ—¥

Software engineering in times of AI

The years 2022 and 2023 have experimented with the most explosion of AI tools impacting us in a way that has ever beenâ€¦
Review of The Healthy Programmer, also applicable to any office work

2023å¹´6æœˆ12æ—¥

Review of The Healthy Programmer, also applicable to any office work

As long we are making progress in our career gathering experience and work more hard to achieve our personal goals weâ€¦

See all articles

How to create a scraper with a? headless browser using Puppeteer and node.js

Hendrix Roa

Building Scalable, Resilient & User-Focused Systems | AWS & IaC (CDK, CDK8s) | Expertise in Backend Systems (Python, Node.js) | Solving Complex Problems, Reducing Tech Debt & Driving High-Impact Architecture Patterns

What is web scraping?

What makes a special headless browser??

Benefits of web scraping

Lets code!

Init function:

Get page

é¢†è‹±æŽ¨è

Close function

The scraper code

The evaluate function?

The next thing is to get the number of stars

Hendrix Roa Publications

374 ä½å…³æ³¨è€…

Hendrix Roaçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

JSON and JavaScript: Comprehensive Guide

Understanding the Power of on_delete CASCADE vs. on_delete PROTECT in Django Framework!!

Upgrading to Next.js 15: Navigating Challenges and Finding Solutions

Using the Fetch API

Advanced JavaScript: Hidden Data Structures

Event based model, its power and pitfalls

Data Structures in JavaScript

GraphQL, what's all about it.*

JavaScript with Node.js

An Introduction To WEBHOOKS and Django Implementation

What is web scraping?

What makes a special headless browser??

Benefits of web scraping

Lets code!

Init function:

Get page

é¢†è‹±æŽ¨è

Close function

The scraper code

The evaluate function?

The next thing is to get the number of stars

Hendrix Roa Publications

374 ä½å…³æ³¨è€…

Hendrix Roaçš„æ›´å¤šæ–‡ç«

Current state of IT recruitment process: focusing developers

Understanding the MVC Pattern: From Its Origins to Modern Development

Understanding Technical Debt

The Duties of a Software Developer Beyond Just Writing Code

A Guide for Effective Communication in Development Using Technical Documentation

Essential Programming Languages and Linter Tools for Modern Development

Essential Software Development Practices: A Comprehensive Guide

Guideline metrics to create better code

Software engineering in times of AI

Review of The Healthy Programmer, also applicable to any office work

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

JSON and JavaScript: Comprehensive Guide

Understanding the Power of on_delete CASCADE vs. on_delete PROTECT in Django Framework!!

Upgrading to Next.js 15: Navigating Challenges and Finding Solutions

Using the Fetch API

Advanced JavaScript: Hidden Data Structures

Event based model, its power and pitfalls

Data Structures in JavaScript

GraphQL, what's all about it.*

JavaScript with Node.js

An Introduction To WEBHOOKS and Django Implementation

é¢†è‹±æŽ¨è

374 ä½å…³æ³¨è€…

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†