7 Tips for Web Automation Optimization Using Puppeteer

7 Tips for Web Automation Optimization Using Puppeteer

Using Puppeteer to automate browser tasks is an excellent way to boost your efficiency as a developer. In this article, I will provide seven tips to help you optimize these tasks.


Contents

  • Introduction
  • Tip 1: Utilize Session Cookies to Bypass the Login Process
  • Tip 2: Leverage 'userDataDir' to Reuse the Same Browser Instance
  • Tip 3: Redirect Browser Console Logs to Node.js for Easier Debugging
  • Tip 4: Optimize Scripts by Minimizing Redundant Navigation Steps
  • Tip 5: Clear the Puppeteer Folder Before Browser Swapping for Cross-Browser Testing
  • Tip 6: Experiment with 'wait-' Options to Ensure Full Page Load
  • Tip 7: Speed Up Puppeteer by Disabling CSS, Images, and Other Unnecessary Resources
  • Conclusion


Introduction

Web automation allows you to navigate the web without manual intervention, enabling you to handle tasks like filling out forms, clicking buttons, navigating pages, scraping website data, and testing web applications. By automating repetitive browser-related tasks, you can focus more time and energy on building essential features.

Puppeteer is one of the most popular libraries in the JavaScript ecosystem for web automation. It offers a high-level API to control Chrome/Chromium via the DevTools Protocol.

While Puppeteer is a powerful tool, it can take some time to master, especially if you're new to it. That’s why in this article, I'll share seven tips to enhance your web automation experience with Puppeteer.


Tip 1: Utilize Session Cookies to Bypass the Login Process

If you need to scrape or crawl data that requires authentication, skipping the login page can save you time.

Instead of logging in using Puppeteer, log in manually on your Chrome browser. Then, export the session cookies to a JSON file using the cookies export extension and use it in your code.

As Puppeteer has the Page.setCookie() method, you can use the same logged-in session. This will help you navigate to the password-protected URL directly as long as the session is valid.

const cookiesAsString = fs.readFileSync('cookies-json')
const parsedCookies = JSON.parse(cookiesString)

if (parsedCookies.length !== 0) {
    for (let cookie of parsedCookies) {
    	await page.setCookie(cookie);
    }
}

await page.goto("password-protected-url", { waitUntil: 'domcontentloaded' });        

You can also do this when you need to run the script multiple times or run different scripts on the website.

If you prefer to use Puppeteer to log in, you can retrieve the browser cookies using Page.cookies(). Save them as a JSON file in your specified directory and use them for subsequent script runs.

const cookies = await page.cookies();

fs.writeFile('cookies.json', JSON.stringify(cookies ), function (err) {
  if (err) {
    console.log('Couldn't save session.', err);
  }
  console.log('The session has been saved.');
});        


Tip 2: Leverage 'userDataDir' to Reuse the Same Browser Instance

This tip is also useful for the previous scenario where we want to skip the login page for a password-protected URL. Chromium’s user data directory contains profile data such as history, bookmarks, cookies, as well as other per-installation local states.

Launching Puppeteer with the userDataDir property will save these data and use the same browser instance every time.

puppeteer.launch({
  userDataDir: "./tmp"
});        

As the session cookies are saved in the user data directory and Puppeteer uses the same browser instance, you can use this to skip the login page too. Login for the first time and the session will be saved for subsequent script runs.

However, the profile data may take up some space.

If other profile data are not utilized, storing only the browser cookies will save more space. That said, it only applies to persisting a login session. Other browser profile data could still be useful for other cases.


Tip 3: Redirect Browser Console Logs to Node.js for Easier Debugging

This tip is specifically for automated testing. Typically, a website's client-side console messages only appear in the browser’s inspector, not directly in Node.js logs.

To automatically open the browser inspector when running Puppeteer, you can set devtools to true

const browser = await puppeteer.launch({ 
        devtools: true 
});        

However, this will default to opening the “Elements” tab, and there's no built-in way to open the “Console” tab directly.

To view console messages in real-time while Puppeteer is running, you can listen to the browser’s console events. This will return a payload with the logged text

page.on('console', (message) =>      console.log(`${message.type().substr(0,3).toUpperCase()} ${message.text()}`));        

This approach allows console messages to appear in your Node.js logs immediately, making debugging much easier.


Tip 4: Optimize Scripts by Minimizing Redundant Navigation Steps

One effective way to optimize your Puppeteer scripts is by reducing unnecessary navigation steps. Each time you navigate to a new URL, it adds to the overall execution time. If your script repeatedly navigates to the same page or reloads a page unnecessarily, it can significantly slow down the automation process, especially when dealing with a large number of pages.

To minimize redundant navigation, consider the following strategies:

  1. Reuse Pages: Instead of opening new pages or tabs for every action, reuse existing ones whenever possible. This reduces the overhead associated with launching new browser instances or tabs.
  2. Efficient Routing: If your script needs to interact with multiple pages on the same site, plan the navigation path to avoid unnecessary back-and-forth. For example, gather all data or perform all actions on one page before moving on to the next.
  3. Conditional Navigation: Before navigating to a new page, check if the desired content or action can be achieved without reloading or moving to a different page. Often, JavaScript can dynamically update content, making full-page reloads unnecessary.
  4. Cache Responses: For pages that don’t change often, consider caching responses and reusing them instead of navigating to the same page multiple times. This can be particularly useful in scenarios like scraping static data.

By streamlining navigation within your script, you can significantly reduce execution time and make your Puppeteer automation more efficient.

Tip 5: Clear the Puppeteer Folder Before Browser Swapping for Cross-Browser Testing

Cross-browser testing is an important practice of testing to make sure that the code works across various browsers. You can use Firefox for Puppeteer by specifying it in the product property.

const browser = await puppeteer.launch({
	product: 'firefox'
  });        

However, when you install Puppeteer, only Chromium is installed. Even though you have specified firefox for the product property, it will launch Chromium.

Therefore, you need to delete the /node_modules/puppeteer folder and reinstall Puppeteer with the product specified as Firefox.

PUPPETEER_PRODUCT=firefox npm i        

This will install Firefox in the /node_modules/puppeteer folder.

Tip 6: Experiment with 'wait-' Options to Ensure Full Page Load

Puppeteer needs to determine the right moment to perform the next action after navigating to a URL. For instance, Puppeteer should wait until a page is fully loaded before capturing a screenshot.

When navigating to a URL, you must specify a Puppeteer lifecycle event using the WaitForOptions.waitUntil property. This applies to both the Page.goto() and Page.waitForNavigation() methods.

The script will only proceed when one of the following events occurs:

  • load (default): Navigation is considered complete when the load event is triggered.
  • networkidle0: Navigation is considered complete when there are no more than 0 network connections for at least 500ms.
  • networkidle2: Navigation is considered complete when there are no more than 2 network connections for at least 500ms.
  • domcontentloaded: Navigation is considered complete when the DOMContentLoaded event is triggered.

await page.goto('https://www.google.com/', { waitUntil: 'networkidle0' });
await page.waitForNavigation({ waitUntil: 'networkidle0' });        

However, these events don’t always guarantee that the page has fully loaded. For instance, some JavaScript scripts might still be running in the background after these events are triggered, potentially modifying the page content.

If you know that a specific HTML element will only appear after the scripts have finished running, you can use waitForSelector with CSS selectors to wait for that element.

Tip 7: Speed Up Puppeteer by Disabling CSS, Images, and Other Unnecessary Resources

A difference of 1-2 seconds might seem insignificant when scraping just a few pages, but it can have a significant impact on performance when dealing with tens of thousands of pages.

If your project doesn't require CSS or images, disabling them can speed up page load times. You can achieve this by intercepting the browser's HTTP requests and blocking any network requests for unnecessary resources.

const page = await browser.newPage()
await page.setRequestInterception(true);

page.on('request', (request) => {
    if (['image', 'stylesheet', 'font'].indexOf(request.resourceType()) !== -1) {
        request.abort();
    } else {
        request.continue();
    }
});        


Basel Ahmed

BIM Engineer, Modeller and Coordinator.

3 个月

Great advice!

Habiba Bastawe

SOC Analyst intern at NTI | Stemclub Code instructor

3 个月

Imprint and distinctive work as usual ??

Anas Hatem

Senior Computer Engineering Student | Coding Instructor @ Timedoor

3 个月

Useful tips!

Norhan Ahmed

Ex Front-end intern @ Thndr??| Software Engineer | Front-end | FullStack | Awarded as 3rd place @We Innovate cyber security Hachathon | Valeo First Female Tech Hachathon participant

3 个月

Very helpful one ?? Great work and effort as usual ????????

Ziad Hazem

Top Rated Software Engineer @ Upwork | Teaching Assistant @ FEHU | Cybersecurity Engineer @ OffenseCrest

3 个月

Puppeteer is an invaluable tool. Thanks for the tips!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了