Web Scraping using Playwright in Python and Javascript
ScrapeHero
We are a data company providing custom APIs for data, Custom Alternative Data, POI Location data and RPA solutions
Browser-based web scraping provides the quickest and easiest solution for scraping javascript-based, client-side rendering web pages. There are multiple frameworks available to build and run browser-based web scrapers. The most common amongst these are?Selenium,?Puppeteer,?and?Playwright. We have already covered Selenium and Puppeteer in our previous articles. Now, let’s take a look at Playwright, the browser automation framework from Microsoft.
Table of Contents
What is Playwright?
Playwright?is a browser automation framework with APIs available in Javascript, Python, .NET, and Java. Its simplicity and powerful automation capabilities make it an ideal tool for web scraping. It also comes with headless browser support.
Features of Playwright:
You can also read:?How to Scrape Google Maps: Code and No-Code Approach
Installation
Python:
Install the python package:
pip install playwright
Install the required browsers:
playwright install
Javascript:
Install using npm
npm init -y
npm install playwright@latest
Install csv writer
npm i objects-to-csv
You can also use playwright codegen to record actions and turn that into code.
Building a scraper
Let’s create a scraper using Playwright to scrape data of the first 3 listing pages from?https://scrapeme.live/shop. We will collect the following data points:
Source Code on Github
You can view the complete code here:
Import the required libraries:
In?Python, Playwright supports both synchronous and asynchronous operations. But?Node.js?is asynchronous in nature, and hence Playwright only supports asynchronous operations in?Node.js.
Here in this article, we used asynchronous Playwright.
# Python
from playwright.async_api import async_playwright
import asyncio
// Javascript
const { chromium } = require('playwright');
Launch the Browser instance:
Here, we can define the browser (Chrome, Firefox, WebKit) and pass the required arguments.
Async/await?is a feature that allows you to execute functions asynchronously while waiting for results. This can improve the performance of your applications by working on multiple threads instead of performing one operation after another synchronously. The?await?keyword releases the flow of control back to the event loop.
# Python
# Launch the headed browser instance
browser = await playwright.chromium.launch(headless=False)
# Python
# Launch the headless browser instance
browser = await playwright.chromium.launch(headless=True)
// Javascript
// Launch headless browser instance
const browser = await chromium.launch({
headless: true,
});
// Javascript
// Launch headed browser instance
const browser = await chromium.launch({
headless: false,
});
Create a new browser context:
Playwright allows us to create a new context from an existing browser instance that won’t share cookies/cache with other browser contexts.
# Python
# Creates a new browser context
context = browser.new_context()
// Javascript
// Creates a new browser context
const page = await browser.newContext();
Create a page from the browser context:
# Python
# opens new page
page = await context.new_page()
// Javascript
// Open new page
const page = await context.newPage();
This will open a Chromium browser. Now, let’s navigate to the?listing page. We can use the below code lines to perform the navigation:
# Python
# Go to https://scrapeme.live/shop
await page.goto('https://scrapeme.live/shop')
// Javascript
// Go to https://scrapeme.live/shop
await page.goto('https://scrapeme.live/shop');
领英推荐
Find and select all product listings:
The products (Pokemons) are listed on this page. In order to get data for each product, we first need to find the element that contains the data for each product and extract the data from it.
If we inspect one of the product listings, we can see that every product is inside a <li> tag, with a common class name “product”.
We can select all such products by looking for all <li> tags with a class name “product”, which can be represented as the CSS selector li.product .
The method called?query selector all?lets you get all the elements that match the selector. If no elements match the selector, it returns an empty list ( [] ).
# Python
all_items = await page.query_selector_all('li.product')
// Javascript
const product = await page.$$eval('li.product', all_items => {})
Select data from each listing:
From each product listing, we need to extract the following data points:
In order to get these details, we need to find the CSS Selectors for the data points. You can do that by inspecting the element, and finding the class name and tag name.
We can now see that the selectors are:
We can use the function?query selector?for selecting the individual elements. The?query selector?returns the first matching element. If no element matches the selector, the return value resolves to null. You can see the implementation below:
# Python
# Looping through listing pages
for i in range(2):
name_el = await item.query_selector('h2')
// Javascript
// Looping through listing pages
for (let i = 2; i < 4; i++)
{
const name_el = await product.querySelector('h2')
}
Extracting text from the elements:
Now, we need to extract the text from the elements. We can use the function?inner text?for extracting the text.
# Python
name = await name_el.inner_text()
// Javascript
const name = name_el.innerText;
Navigate to the next page:
Now, we need to extract the data from the next page. To perform this action, we need to find the element-locator of the?next?button. For this, we can use the method?locator?in playwright.
The method?locator?returns an element locator that can be used for various operations, such as click, fill, tap, etc. The function supports pattern matching(RegEx), XPath, and selectors.
# Python
next = page.locator("text=→").nth(1)
// Javascript
next = page.locator("text=→").nth(1)
Now, we need to click on the next button. To perform this, we can use the function?click. You may need to wait for the required elements to load on the page. To ensure this, we can use the function?wait for selector.
# Python
await next.click()
# wait for the selector to load
await page.wait_for_selector('li.product')
// Javascript
await next.click();
// wait for selector to load
await page.waitForSelector('li.product');
Close the browser and context:
After completing the task, we need to close all the context and browser instances.
# python
await context.close()
await browser.close()
// Javascript
await context.close();
await browser.close();
After closing both contexts and browser, we need to save the data into a CSV file. For saving into CSV in javascript we need an external package to be installed. The Installation command is given below
npm i objects-to-csv
Setting up headless mode and proxies in the browser:
Why do you need proxies for web scraping?
A proxy is an invisible cloak that hides your IP address and allows seamless access to your data without being blocked. With a proxy, the website you request no longer sees your original IP address, but instead sees the proxy’s IP address, allowing you to browse the website without getting detected.
You can check out this article to learn more:?How To Rotate Proxies and change IP Addresses using Python 3
Why do you need a headless browser?
A browser without a user interface(UI) is called a headless browser. It can render the website like any other standard browser. They are better, less time-consuming, and faster. Since the headless browser does not have a UI, it has minimal overhead and can be used for tasks like web scraping and automation.
Both of these can be achieved while defining and launching the browser:
// Javascript
const browser = await chromium.launch({
headless: true,
proxy: {
server: '<proxy>',
username: '<username>',
password: '<password>'
}
});
# Python
browser = playwright.chromium.launch(headless=True, proxy={
"server": "<proxy>",
"username": "<username>",
"password": "<password>"
})
Source Code on Github
You can view the complete code here:
If you would like to learn how to speed up your browser based web scrapers, please read the article below.