登录查看更多内容

Writing a web scraper with ChatGPT. Is it a good idea?

Pierluigi Vinciguerra

Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club

发布日期: 2023年4月16日

In November, after OpenAI released ChatGPT, based on GPT-3, the news was literally everywhere.?I also wrote on that occasion about AI and web scraping?and, after that, in every interview I’ve made, I’ve asked my interlocutors what was their point of view on the state of AI for the web scraping industry.

Five months later, we have GPT-4 and tons of applications have been built on top of GPT models so it’s time to have a closer look at AI for web scraping.

Can AI write scrapers for us?

At the moment, we cannot expect ChatGPT to write a fully working scraper for us on a chosen website. It will return a syntactically correct scraper but with generic selectors, not useful for our case. If we ask to scrape some well-known website maybe it could return some correct mapping, if the answer was already given in some place like Stackoverflow in the past.

Given that, let’s try to build a scraper from scratch with ChatGPT for a niche website as https://www.gianvitorossi.com/it_it/

I’ll go straight to the final prompt I wrote to get a correct Scrapy spider, where I asked to map four fields and explained to ChatGPT the access plan to the website.

Create a Python web scraper using the Scrapy framework and XPath selectors. The start_urls parameter should contain the URL [https://www.gianvitorossi.com/it_it](https://www.gianvitorossi.com/it_it) to be scraped.

After scraping the first page, the scraper should enter a function called get_categories.

In this function the scraper extracts all the relative links to different product types, using the following XPath selector: product_pages = response.xpath('//a[@role="menuitem"]/@href').extract()

For each product page, it should enter a function called get_product_list.

Extract the list of products on the page using the XPath selector (’//a[contains(@class, “b-product_tile-image_link”)]/@href).

If there are more products to be shown, crawl to the following product page using the XPath ('//a[data-event-click.prevent="loadMore"]/@href).extract()[0], which contains the absolute path to the next page, and include this selector in a try/except clause. If there is no match for the selector, a pass instruction should be used in the except clause.

For each product extracted, enter a function called get_product_details.

Write the logic to scrape the following fields:

- product code

- full_price

- price

- product_url

The product detail page contains a JSON containing most of the data we need. It can be found by looking for the first occurrence of the script type="application/ld+json" and extract the text from there. It follows the standard Product [Schema.org](https://schema.org/) type and you can map the field product_code to the Json's mpn field.

The field price can be mapped by entering the offers dict in the JSON and selecting the price field.

The product_url field can instead be mapped using the response.url value.

Since there are no discounts on this website, full_price field can be mapped as price field.

And here’s the code given in the response:

ScraperAPI 1 年前

ChatGPT’s New Web Crawler: Friend or Foe?

Tyler Schroeder 1 年前

The Future of Web 3.0: Vector DBs and ChatGPT Plugins

Chris Clark 1 年前

import scrapy
import json

class GianvitoRossiSpider(scrapy.Spider):
    name = 'gianvitorossi'
    start_urls = ['https://www.gianvitorossi.com/it_it']

    def parse(self, response):
        # Extract product categories
        product_pages = response.xpath('//a[@role="menuitem"]/@href').extract()
        for product_page in product_pages:
            yield scrapy.Request(response.urljoin(product_page), callback=self.get_product_list)

    def get_product_list(self, response):
        # Extract product links
        product_links = response.xpath('//a[contains(@class, "b-product_tile-image_link")]/@href')
        for link in product_links:
            yield scrapy.Request(response.urljoin(link.extract()), callback=self.get_product_details)

        # Check for more product pages
        try:
            next_page = response.xpath('//a[data-event-click.prevent="loadMore"]/@href').extract()[0]
            yield scrapy.Request(response.urljoin(next_page), callback=self.get_product_list)
        except:
            pass

    def get_product_details(self, response):
        # Extract JSON data
        json_data = response.xpath('//script[@type="application/ld+json"]/text()').extract_first()
        data = json.loads(json_data)

        # Extract product details
        product_code = data['mpn']
        full_price = data['offers']['price']
        price = full_price
        product_url = response.url

        # Return item
        yield {
            'product_code': product_code,
            'full_price': full_price,
            'price': price,
            'product_url': product_url
        }

The scraper runs correctly and retrieves all the items. But does writing the prompt needed to create the spider required less time than writing the spider by myself? Difficult to say. In both cases, I need to write down the XPath selectors, so AI didn’t save me time. Let’s suppose we have a properly engineered prompt and a Scrapy spider template to reuse, we can assume the time consumed modifying the prompt or the scraper is the same. The only improvement in the process is that ChatGPT wrote the scraper without any bugs.

Bugfixing with the AI

Let’s test the utility of ChatGPT when we’re in front of a bug in our scrapers. I’ll modify the previous spider by deleting a quote from an Xpath and changing the name of a variable non-initiated.

    def get_product_details(self, response):

        # Extract JSON data

        json_data = response.xpath('//script[@type="application/ld+json"]/text()).extract_first()

        data = json.loads(json_data)

        # Extract product details

        product_code = data['mpn']

        full_price = data['offers']['price']

        price = full_price

        product_url = response.url

        # Return item

        yield {

            'product_code': product_code3,

            'full_price': full_price,

            'price': price,

            'product_url': product_url

        }

Will ChatGPT be able to fix it?

Yes and not only it returned the corrected code but also explained what it fixed.

The error in the original code was a syntax error in the XPath selector used to extract the JSON data. The closing quotation mark was placed incorrectly, resulting in an invalid selector. The corrected code places the closing quotation mark in the correct location, allowing the selector to extract the JSON data correctly. Additionally, there was a typo in the name of the?product_code?field in the?yield?statement, which has been fixed in the corrected code.

This is a helpful way to use ChatGPT since sometimes we spend a lot of time looking for the solution to these silly mistakes.

Porting to another language

Instead of creating a scraper with Python, I used the same prompt to create a Node.js scraper. I don’t know anything about this language and let’s see if ChatGPT can help me in creating my first scraper in Node.js

Here’s the result:

const axios = require('axios');
const cheerio = require('cheerio');

const startUrl = 'https://www.gianvitorossi.com/it_it';

async function scrape(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract product categories
    const productPages = $('a[role="menuitem"]').map(function() {
      return $(this).attr('href');
    }).get();

    for (const productPage of productPages) {
      await getProducts(`https://www.gianvitorossi.com${productPage}`);
    }
  } catch (error) {
    console.error(error);
  }
}

async function getProducts(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract product links
    const productLinks = $('a.b-product_tile-image_link').map(function() {
      return $(this).attr('href');
    }).get();

    for (const link of productLinks) {
      await getProductDetails(`https://www.gianvitorossi.com${link}`);
    }

    // Check for more product pages
    try {
      const nextPage = $('a[data-event-click.prevent="loadMore"]').attr('href');
      await getProducts(`https://www.gianvitorossi.com${nextPage}`);
    } catch (error) {
      // No more product pages
    }
  } catch (error) {
    console.error(error);
  }
}

async function getProductDetails(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    // Extract JSON data
    const jsonText = $('script[type="application/ld+json"]')[0].children[0].data;
    const data = JSON.parse(jsonText);

    // Extract product details
    const productCode = data.mpn;
    const fullPrice = data.offers.price;
    const price = fullPrice;
    const productUrl = response.config.url;

    // Output data
    console.log({
      productCode: productCode,
      fullPrice: fullPrice,
      price: price,
      productUrl: productUrl
    });
  } catch (error) {
    console.error(error);
  }
}

scrape(startUrl);

This is great. The scraper is correct and works perfectly.

Final remarks

I have to admit I’m impressed by the results of these tests. While it’s true it didn’t save me much time to me when I wrote the Python scraper, it enabled me to write the Node.js one.

AI can lower the barriers to the adoption of a new language and help develop the scrapers while still studying it, helping also to debug the code written. It doesn’t substitute the good old hands-on practice but it could help learn faster.

In the end, AI at the moment is basically more than an aid than a threat that could replace humans in the near future.

Pierluigi Vinciguerra的更多文章

The new OpenAI User Agent and its consequences

2023年8月9日

The new OpenAI User Agent and its consequences

The latest post by Gergely Orosz from The Pragmatic Engineer put the focus on one of the major concerns about the…

1 条评论
What is device fingerprinting?

2023年5月21日

What is device fingerprinting?

This is a post from The Web Scraping Club newsletter, if you don't want to miss other posts about web scraping tools…

12 条评论
Web Scraping news recap - April 2023

2023年5月1日

Web Scraping news recap - April 2023

Hi everyone and welcome back to The Web Scraping Club, this post is our monthly review of what happened in the web…

2 条评论
Web scraping and alternative data for financial markets

2023年4月25日

Web scraping and alternative data for financial markets

We have seen in many posts how to scrape the web under several circumstances, like when there’s a Cloudflare-protected…

3 条评论
How to scrape Datadome protected websites (early 2023 version)

2023年4月14日

How to scrape Datadome protected websites (early 2023 version)

Let’s continue our journey on the tackle of antibot systems. Today, after seeing Kasada and Cloudflare, it’s the turn…

4 条评论
XPath vs CSS selectors: a comparison

2023年4月2日

XPath vs CSS selectors: a comparison

When creating a web scraper, one of the first decisions is to choose which type of selector to use. But what is a…

10 条评论
Bypass Cloudflare with these web scraping tools

2023年2月14日

Bypass Cloudflare with these web scraping tools

In this article of The Web Scraping Club we see the Python tools we can use to bypass Cloudflare protected websites…

11 条评论
Bypass Cloudflare Bot Protection with GoLogin

2023年1月19日

Bypass Cloudflare Bot Protection with GoLogin

Here's an abstract from the latest post on The Web Scraping Club substack, where we tackle Cloudflare anti-bot solution…

12 条评论
How I've built my home made mobile proxy

2023年1月15日

How I've built my home made mobile proxy

This article is published on The Web Scraping Club substack. If you liked it and don't want to miss other updates on…

10 条评论
Scraping OpenSea data to analyze NFT collections

2023年1月6日

Scraping OpenSea data to analyze NFT collections

This article is extracted from The Web Scraping Club newsletter, a substack about web scraping with examples…

4 条评论

See all articles

Writing a web scraper with ChatGPT. Is it a good idea?

Pierluigi Vinciguerra

Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club

Can AI write scrapers for us?

领英推荐

Bugfixing with the AI

Porting to another language

Final remarks

Top posts from The Web Scraping Club

Pierluigi Vinciguerra的更多文章

社区洞察

其他会员也浏览了

Harnessing ChatGPT to Populate Your Entire Website with Fresh Content: Part 2: 02_Get_Titles_Based_on_Trend

Auto-GPT – Promise versus Reality

How to integrate OpenAI features in your applications ?

RAG-Based Multi-Source Chatbot Using LLM

Train ChatGPT on Custom Data and Unlock its Potential

GPT-4: The Next Big Thing

OpenAI's New Web Crawler GPTBot - What You Need to Know

OpenAi’s SORA model: How the Magic of text-to-video works

ChatGPT on Steroids: My Techniques to Accelerate OpenAI up to XX Times (+code & templates)

The AI wave is here... ChatGPT Image generation and code completion beta

Can AI write scrapers for us?

领英推荐

Bugfixing with the AI

Porting to another language

Final remarks

Top posts from The Web Scraping Club

Pierluigi Vinciguerra的更多文章

The new OpenAI User Agent and its consequences

What is device fingerprinting?

Web Scraping news recap - April 2023

Web scraping and alternative data for financial markets

How to scrape Datadome protected websites (early 2023 version)

XPath vs CSS selectors: a comparison

Bypass Cloudflare with these web scraping tools

Bypass Cloudflare Bot Protection with GoLogin

How I've built my home made mobile proxy

Scraping OpenSea data to analyze NFT collections

社区洞察

其他会员也浏览了

Harnessing ChatGPT to Populate Your Entire Website with Fresh Content: Part 2: 02_Get_Titles_Based_on_Trend

Auto-GPT – Promise versus Reality

How to integrate OpenAI features in your applications ?

RAG-Based Multi-Source Chatbot Using LLM

Train ChatGPT on Custom Data and Unlock its Potential

GPT-4: The Next Big Thing

OpenAI's New Web Crawler GPTBot - What You Need to Know

OpenAi’s SORA model: How the Magic of text-to-video works

ChatGPT on Steroids: My Techniques to Accelerate OpenAI up to XX Times (+code & templates)

The AI wave is here... ChatGPT Image generation and code completion beta