Puppeteer Cluster: Beginner?guide

If you are a developer who's looking to perform mass scraping from various websites and traditional linear scraping methods don't work for you, this article is for you. There is a library called puppeteer-cluster for Puppeteer developers to support mass scraping.

In this article, we will discuss how to use Puppeteer Cluster for mass scraping and how to configure it. Here are content of this article :

What is Clustering?
Advantages of Clustering
What is Puppeteer cluster?
Installation and Basic setup
Advanced setup of Puppeteer cluster
Preventing common errors with Puppeteer cluster

What is Clustering?

Clustering in Node.js empowers applications to fully leverage the parallel processing power of modern hardware. Rather than relying on a single Node.js process to manage all the logic, clustering allows you to spawn multiple identical worker processes that share the workload. Each worker operates independently, handling incoming connections and processing requests simultaneously, which significantly enhances the efficiency and scalability of your application.

These worker processes in Node.js are essentially separate instances of the application, yet they share the same port. The master process, which orchestrates the cluster, is responsible for distributing incoming connections across the workers, ensuring a balanced load. If one worker becomes overwhelmed or crashes in our case for managing multiple browser instances, the master can spawn a new one to maintain seamless service. This approach not only improves fault tolerance but also maximizes resource utilization, making your Node.js application more robust and scalable in high-traffic environments.

Advantages of Clustering

There are several benefits to using clustering in NodeJS:

Improved Performance: By distributing the workload across multiple CPU cores, clustering can significantly improve the overall performance of NodeJS applications, leading to faster response times and better throughput.
Enhanced Scalability: Clustering makes it easier to scale NodeJS applications horizontally by adding more worker processes as needed. As the application load increases, you can dynamically spawn additional workers to handle the increased traffic.
Fault Tolerance: Clustering provides built-in fault tolerance by isolating worker processes from each other. If one worker crashes due to an unexpected error, the remaining workers can continue to handle requests without interruption.
Resource Utilization: Clustering allows you to make full use of the available system resources, effectively leveraging the multi-core architecture of modern servers. This can lead to better utilization of CPU and memory resources, resulting in higher application efficiency.

What is Puppeteer cluster?

Suppose you are attempting large-scale web scraping and have to manually manage multiple instances of Puppeteer. In a typical scenario, such as scraping data from multiple e-commerce sites for price comparison, you would need to write separate scripts for each Puppeteer instance. You will have to initiate and control them individually. This task can be very challenging, as managing the synchronization of these instances becomes complex, especially when dealing with hundreds of pages. You will also have to handle errors and crashes in individual instances, and resource allocation must be manually optimized to prevent overloading the system.

Puppeteer Cluster offers an elegant solution to these challenges. It allows you to create a pool, or 'cluster,' of browser instances that automates the management of these processes. Instead of handling each browser instance individually, Puppeteer Cluster lets you establish a cluster to which you can assign tasks. The cluster intelligently distributes these tasks across its pool of browser instances, optimizing resource usage and maximizing efficiency. One of the standout features of Puppeteer Cluster is its ability to control concurrency—you can specify how many tasks run in parallel, effectively managing system resources and preventing server overload.

In addition, Puppeteer Cluster is equipped with robust error-handling mechanisms. If a browser instance encounters an error or crashes, the cluster automatically retries the task with another instance, ensuring continuity. The cluster is also highly scalable, accommodating tasks that require processing tens, hundreds, or even more pages simultaneously, making it adaptable to the varying demands of your scraping projects.

With a clear understanding of the benefits Puppeteer Cluster offers, let's explore how to implement it effectively in your workflow.

Installation and Basic setup

puppeteer-cluster is a npm package that can easily installed via npm or any other node package manager, it also installs puppeteer for you so no need to install puppeteer for it to work, run the following command in the terminal of your project :

$ npm install puppeteer-cluster

Let’s see how to use puppeteer-cluster for scraping. In this example, we will run two browsers in parallel.?

const { Cluster } = require('puppeteer-cluster');

async function main() {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 2,
        puppeteerOptions: {
            headless: false, 
            defaultViewport: false
        }
    });

    cluster.task(async ({ page, data: url }) => {
        await page.goto(url);
        const title = await page.title();
        console.log(`Title of ${url}: ${title}`);
    });

    cluster.queue('https://www.google.com');
    cluster.queue('https://www.github.com');

    await cluster.idle();
    await cluster.close();
}

main();

Output :

Title of https://www.google.com: Google
Title of https://www.github.com: GitHub: Let’s build from here · GitHub

In this script, we start by initializing a new cluster using the `Cluster.launch` method. We set the concurrency mode to `Cluster.CONCURRENCY_CONTEXT`, which ensures that each browser operates in its own context, similar to incognito sessions. The `maxConcurrency` option is set to 2, allowing two browser instances to run concurrently. You can adjust this number based on your specific needs.

The `cluster.task` method outlines what each browser instance should do. In this case, it navigates to a specified URL, retrieves the page title, and logs it to the console. You’ll need to modify this logic depending on the task you're working on.

Next, we queue two URLs using `cluster.queue` . Puppeteer Cluster will automatically distribute these URLs across the available browser instances. You can queue as many URLs as necessary.

Finally, `cluster.idle` waits for all the queued tasks to finish, and `cluster.close` shuts down the cluster, closing all browser instances.

Advanced setup of Puppeteer cluster

The concept of concurrency mode is fundamental when working with Puppeteer-Cluster, as it dictates how tasks like opening pages and executing scripts are managed and kept separate from one another.

Puppeteer-Cluster offers three concurrency models to choose from, depending on your specific use case. Selecting the appropriate concurrency model is crucial for optimizing performance. The three options are:

CONCURRENCY_PAGE: In this model, each worker in the cluster manages a single page within a shared browser instance. This is ideal for tasks that can operate within the same browser context but require separate pages.

Use case: Suppose you're developing a script to scrape product prices from various e-commerce websites. The goal is to open multiple product pages and extract pricing information. Each worker can handle a separate product page within the same browser instance. Since all these pages can share the same browser context—such as cookies from the e-commerce site—this model is efficient.

CONCURRENCY_CONTEXT: In this setup, each worker operates within a distinct browser context inside the same browser instance. This model is best suited for tasks that require isolation from each other but do not need separate browser instances.

Use case: Consider a scenario where you need to scrape data from multiple social media profiles, requiring different login sessions. This model is ideal for managing multiple login sessions simultaneously. Each context can log in to a different account, isolating session data like cookies and local storage.

CONCURRENCY_BROWSER: This model assigns each worker its own browser instance, offering complete isolation at the browser level. While it is the most resource-intensive option, it provides the highest level of isolation and flexibility.

Use case: This model is suitable if you need to scrape a website and ensure that the information remains consistent across different browser types.

To choose a concurrency model in Puppeteer-Cluster, you simply set the concurrency property when launching the cluster. Here’s an example.

Preventing common errors with Puppeteer cluster

When working with Puppeteer-Cluster, encountering errors is common, especially with complex web scraping or automation tasks. Here, we’ll explore three frequent issues and their solutions.

Error: Page Detached from Node

This error can be particularly frustrating, as it typically arises when a page’s task is time-consuming and exceeds 30 seconds. After extensive trial and error, I discovered a less-documented attribute in the Cluster.launch options: the timeout attribute. This attribute controls the termination of a Page instance within a task, with a default value of 30,000 milliseconds (30 seconds). When the page is terminated, any attempt to use querySelector or similar methods on the page's document will result in a "Detached" error from the cluster.

Solutions:

To prevent this issue according to the solution of this GitHub issue, the effective solution is to adjust the timeout attribute to a maximum value. You can do this by setting the attribute in the following manner:

const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 2,
        puppeteerOptions: {
            headless: false, 
            defaultViewport: false
        },
        timeout: Number.MAX_SAFE_INTEGER
    });

Setting the timeout to a sufficiently high value or disabling it altogether helps avoid premature termination of the Page instance, ensuring that tasks have adequate time to complete and reducing the likelihood of encountering the Detached error.

TimeoutError: Navigation Timeout Exceeded

This error occurs when a page takes too long to load. Potential causes include slow network conditions, heavy page content, or unresponsive servers.

Solution:

Increase Timeout: Adjust the timeout settings by using `await page.goto(url, { timeout: 60000 })` or page.waitForSelector(). This allows more time for the page to load.
Investigate Issues: Examine the reasons behind slow network conditions, heavy page content, or unresponsive servers.
Implement Retries: Introduce automatic retry or page refresh mechanisms to handle transient issues.

- Disable JavaScript: If the website heavily relies on JavaScript, try disabling it by using page.setJavaScriptEnabled(false) to see if it improves load times.

By addressing these common errors effectively, you can improve the reliability and efficiency of your Puppeteer-Cluster tasks.

Error: Page Closed Unexpectedly

This error occurs when there are unhandled exceptions in the task, crashes within the browser instance, or when the browser or Chromium instances are closed manually. In Puppeteer-Cluster, it’s vital to manage crawling errors effectively and ensure that tasks are restarted automatically, particularly for large-scale scraping operations.

Solutions:

Implement Error Handling: Ensure comprehensive error handling is in place to catch and manage exceptions.
Optimize Resource Management: Properly manage resources to prevent crashes and ensure stable operation.
Avoid Manual Browser Closure: Refrain from manually closing the browser instances to prevent unexpected disruptions.

Puppeteer Cluster: Beginner?guide

Zyad Abdul-Nasser

Software Engineer

What is Clustering?

Advantages of Clustering

What is Puppeteer cluster?

Installation and Basic setup

领英推荐

Advanced setup of Puppeteer cluster

Preventing common errors with Puppeteer cluster

Error: Page Detached from Node

TimeoutError: Navigation Timeout Exceeded

Error: Page Closed Unexpectedly

社区洞察

其他会员也浏览了

GraphQL vs REST – Difference Between APIs I DEVxHUB

Backend Frameworks: Why They're Essential & The Top Picks in 2024

Top Tech Stacks: Choosing the Right Technologies

Rails 8: the upgrade you can’t miss

Creating Efficient Server-Side Applications That Never Fail

React Hooks: A Deep Dive into the Most Useful Hooks

Mastering .NET Core: Advanced Techniques and Best Practices

Using Node.js with GraphQL

Beyond REST: Exploring GraphQL for Modern API Development

Unlocking the Power of Flask-SQLAlchemy: A Game-Changer in Web Development