Writing a web scraper with ChatGPT. Is it a good idea?
Photo by Emiliano Vittoriosi on Unsplash

Writing a web scraper with ChatGPT. Is it a good idea?

In November, after OpenAI released ChatGPT, based on GPT-3, the news was literally everywhere.?I also wrote on that occasion about AI and web scraping?and, after that, in every interview I’ve made, I’ve asked my interlocutors what was their point of view on the state of AI for the web scraping industry.

Five months later, we have GPT-4 and tons of applications have been built on top of GPT models so it’s time to have a closer look at AI for web scraping.

Can AI write scrapers for us?

At the moment, we cannot expect ChatGPT to write a fully working scraper for us on a chosen website. It will return a syntactically correct scraper but with generic selectors, not useful for our case. If we ask to scrape some well-known website maybe it could return some correct mapping, if the answer was already given in some place like Stackoverflow in the past.

Given that, let’s try to build a scraper from scratch with ChatGPT for a niche website as https://www.gianvitorossi.com/it_it/

I’ll go straight to the final prompt I wrote to get a correct Scrapy spider, where I asked to map four fields and explained to ChatGPT the access plan to the website.

Create a Python web scraper using the Scrapy framework and XPath selectors. The start_urls parameter should contain the URL [https://www.gianvitorossi.com/it_it](https://www.gianvitorossi.com/it_it) to be scraped.
After scraping the first page, the scraper should enter a function called get_categories.
In this function the scraper extracts all the relative links to different product types, using the following XPath selector: product_pages = response.xpath('//a[@role="menuitem"]/@href').extract()
For each product page, it should enter a function called get_product_list.
Extract the list of products on the page using the XPath selector (’//a[contains(@class, “b-product_tile-image_link”)]/@href).
If there are more products to be shown, crawl to the following product page using the XPath ('//a[data-event-click.prevent="loadMore"]/@href).extract()[0], which contains the absolute path to the next page, and include this selector in a try/except clause. If there is no match for the selector, a pass instruction should be used in the except clause.
For each product extracted, enter a function called get_product_details.
Write the logic to scrape the following fields:
- product code
- full_price
- price
- product_url
The product detail page contains a JSON containing most of the data we need. It can be found by looking for the first occurrence of the script type="application/ld+json" and extract the text from there. It follows the standard Product [Schema.org](https://schema.org/) type and you can map the field product_code to the Json's mpn field.
The field price can be mapped by entering the offers dict in the JSON and selecting the price field.
The product_url field can instead be mapped using the response.url value.
Since there are no discounts on this website, full_price field can be mapped as price field.

And here’s the code given in the response:

import scrapy
import json

class GianvitoRossiSpider(scrapy.Spider):
    name = 'gianvitorossi'
    start_urls = ['https://www.gianvitorossi.com/it_it']

    def parse(self, response):
        # Extract product categories
        product_pages = response.xpath('//a[@role="menuitem"]/@href').extract()
        for product_page in product_pages:
            yield scrapy.Request(response.urljoin(product_page), callback=self.get_product_list)

    def get_product_list(self, response):
        # Extract product links
        product_links = response.xpath('//a[contains(@class, "b-product_tile-image_link")]/@href')
        for link in product_links:
            yield scrapy.Request(response.urljoin(link.extract()), callback=self.get_product_details)

        # Check for more product pages
        try:
            next_page = response.xpath('//a[data-event-click.prevent="loadMore"]/@href').extract()[0]
            yield scrapy.Request(response.urljoin(next_page), callback=self.get_product_list)
        except:
            pass

    def get_product_details(self, response):
        # Extract JSON data
        json_data = response.xpath('//script[@type="application/ld+json"]/text()').extract_first()
        data = json.loads(json_data)

        # Extract product details
        product_code = data['mpn']
        full_price = data['offers']['price']
        price = full_price
        product_url = response.url

        # Return item
        yield {
            'product_code': product_code,
            'full_price': full_price,
            'price': price,
            'product_url': product_url
        }        

The scraper runs correctly and retrieves all the items. But does writing the prompt needed to create the spider required less time than writing the spider by myself? Difficult to say. In both cases, I need to write down the XPath selectors, so AI didn’t save me time. Let’s suppose we have a properly engineered prompt and a Scrapy spider template to reuse, we can assume the time consumed modifying the prompt or the scraper is the same. The only improvement in the process is that ChatGPT wrote the scraper without any bugs.

Bugfixing with the AI

Let’s test the utility of ChatGPT when we’re in front of a bug in our scrapers. I’ll modify the previous spider by deleting a quote from an Xpath and changing the name of a variable non-initiated.

    def get_product_details(self, response):

        # Extract JSON data

        json_data = response.xpath('//script[@type="application/ld+json"]/text()).extract_first()

        data = json.loads(json_data)

        # Extract product details

        product_code = data['mpn']

        full_price = data['offers']['price']

        price = full_price

        product_url = response.url

        # Return item

        yield {

            'product_code': product_code3,

            'full_price': full_price,

            'price': price,

            'product_url': product_url

        }        

Will ChatGPT be able to fix it?

Yes and not only it returned the corrected code but also explained what it fixed.

The error in the original code was a syntax error in the XPath selector used to extract the JSON data. The closing quotation mark was placed incorrectly, resulting in an invalid selector. The corrected code places the closing quotation mark in the correct location, allowing the selector to extract the JSON data correctly. Additionally, there was a typo in the name of the?product_code?field in the?yield?statement, which has been fixed in the corrected code.

This is a helpful way to use ChatGPT since sometimes we spend a lot of time looking for the solution to these silly mistakes.


Porting to another language

Instead of creating a scraper with Python, I used the same prompt to create a Node.js scraper. I don’t know anything about this language and let’s see if ChatGPT can help me in creating my first scraper in Node.js

Here’s the result:

const axios = require('axios');
const cheerio = require('cheerio');

const startUrl = 'https://www.gianvitorossi.com/it_it';

async function scrape(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract product categories
    const productPages = $('a[role="menuitem"]').map(function() {
      return $(this).attr('href');
    }).get();

    for (const productPage of productPages) {
      await getProducts(`https://www.gianvitorossi.com${productPage}`);
    }
  } catch (error) {
    console.error(error);
  }
}

async function getProducts(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract product links
    const productLinks = $('a.b-product_tile-image_link').map(function() {
      return $(this).attr('href');
    }).get();

    for (const link of productLinks) {
      await getProductDetails(`https://www.gianvitorossi.com${link}`);
    }

    // Check for more product pages
    try {
      const nextPage = $('a[data-event-click.prevent="loadMore"]').attr('href');
      await getProducts(`https://www.gianvitorossi.com${nextPage}`);
    } catch (error) {
      // No more product pages
    }
  } catch (error) {
    console.error(error);
  }
}

async function getProductDetails(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract JSON data
    const jsonText = $('script[type="application/ld+json"]')[0].children[0].data;
    const data = JSON.parse(jsonText);

    // Extract product details
    const productCode = data.mpn;
    const fullPrice = data.offers.price;
    const price = fullPrice;
    const productUrl = response.config.url;

    // Output data
    console.log({
      productCode: productCode,
      fullPrice: fullPrice,
      price: price,
      productUrl: productUrl
    });
  } catch (error) {
    console.error(error);
  }
}

scrape(startUrl);        

This is great. The scraper is correct and works perfectly.

Final remarks

I have to admit I’m impressed by the results of these tests. While it’s true it didn’t save me much time to me when I wrote the Python scraper, it enabled me to write the Node.js one.

AI can lower the barriers to the adoption of a new language and help develop the scrapers while still studying it, helping also to debug the code written. It doesn’t substitute the good old hands-on practice but it could help learn faster.

In the end, AI at the moment is basically more than an aid than a threat that could replace humans in the near future.



Top posts from The Web Scraping Club

Tichakunda Mangono

CTO @Enspi.io | Building intelligent energy management systems

1 年

Cool to see this comparison! Even though time savings may be slim, the idea of using AI to port to another language is powerful as it allows accessibility to features of a different programming language in cases where you need to integrate into existing/specific code bases.

回复
Kaio Mano

CTO | Data Engineer

1 年

I never liked code generators. Probably because I care about high performance. I confess that I made few attempts with chatGPT, but the reality with the use of AI in code generation is different from “dumb” generators… I’ll try again soon! ??

要查看或添加评论,请登录

Pierluigi Vinciguerra的更多文章

  • The new OpenAI User Agent and its consequences

    The new OpenAI User Agent and its consequences

    The latest post by Gergely Orosz from The Pragmatic Engineer put the focus on one of the major concerns about the…

    1 条评论
  • What is device fingerprinting?

    What is device fingerprinting?

    This is a post from The Web Scraping Club newsletter, if you don't want to miss other posts about web scraping tools…

    12 条评论
  • Web Scraping news recap - April 2023

    Web Scraping news recap - April 2023

    Hi everyone and welcome back to The Web Scraping Club, this post is our monthly review of what happened in the web…

    2 条评论
  • Web scraping and alternative data for financial markets

    Web scraping and alternative data for financial markets

    We have seen in many posts how to scrape the web under several circumstances, like when there’s a Cloudflare-protected…

    3 条评论
  • How to scrape Datadome protected websites (early 2023 version)

    How to scrape Datadome protected websites (early 2023 version)

    Let’s continue our journey on the tackle of antibot systems. Today, after seeing Kasada and Cloudflare, it’s the turn…

    4 条评论
  • XPath vs CSS selectors: a comparison

    XPath vs CSS selectors: a comparison

    When creating a web scraper, one of the first decisions is to choose which type of selector to use. But what is a…

    10 条评论
  • Bypass Cloudflare with these web scraping tools

    Bypass Cloudflare with these web scraping tools

    In this article of The Web Scraping Club we see the Python tools we can use to bypass Cloudflare protected websites…

    11 条评论
  • Bypass Cloudflare Bot Protection with GoLogin

    Bypass Cloudflare Bot Protection with GoLogin

    Here's an abstract from the latest post on The Web Scraping Club substack, where we tackle Cloudflare anti-bot solution…

    12 条评论
  • How I've built my home made mobile proxy

    How I've built my home made mobile proxy

    This article is published on The Web Scraping Club substack. If you liked it and don't want to miss other updates on…

    10 条评论
  • Scraping OpenSea data to analyze NFT collections

    Scraping OpenSea data to analyze NFT collections

    This article is extracted from The Web Scraping Club newsletter, a substack about web scraping with examples…

    4 条评论

社区洞察

其他会员也浏览了