Unlocking the power of scrapper to power your AI & Software
Rohan Girdhani (The TechDoc)
I help you make softwares that your customers can’t ignore and systems that can support your next big milestone | DM if you are struggling with your tech, architecture or security.
A tutorial that shows you how to crawl, extract, and process web data to feed, fine-tune, or train large language models.
Generative AI solutions begin with web scraping
I've been going on about big brain AI models for a while. They're cool and all, but let's not think they can do everything. Sometimes, people get too excited and think these AIs are magic.
The thing is, these AI models can't always give you new or right info. When you need the latest facts, web scraping is what you want. It's like, if AIs are looking in a old book, web scraping gives you today's news.
?? As of September 27, 2023, GPT-4's knowledge is no longer limited to data before September 2021:
Web scraping isn't just a trick to teach those big AI brains, it's also the secret sauce that devs use to make them even smarter and more personal.
With web scraping tools (like the one in the upcoming guide), you can give your AI models a snack, tweak them to perfection, or even train them from scratch. You can also use it to give ChatGPT and its AI buddies some extra details to chew on. This magic trick is useful for loads of stuff, including:
Introducing Website Content Crawler for data ingestion
To feed and fine-tune LLMs, it's not enough to just scrape data. You need to process and clean it before you can use it for generative AI and machine learning. So, in this tutorial, I'm going to use Website Content Crawler, which was designed specifically for this purpose. This guide will demonstrate why WCC is useful for collecting data for LLMs.
Website Content Crawler is what Apify calls an Actor (a serverless cloud program). Actors can perform anything from a simple action, such as filling out a web form or sending an email, to complex operations, such as crawling an entire website and removing duplicates from a large dataset.
Like all Apify Actors, you can run WCC via:
If you're new to Apify, using the UI is the easiest way to test it out, so that's the method I'm going to use in this tutorial.
To use this tool and follow along with me, go to Website Content Crawler in Apify Store and click the Try for free button.
You'll need an Apify account. If you don't have one, you'll be prompted to sign up when you click that button.
Otherwise, you'll be taken straight to Apify Console (which is basically your dashboard), and you'll see the UI that I'm about to walk you through.
How to collect web data for LLMs
1. Start URLs
I'm going to use the default input and scrape the Apify docs using the following start URL: https://docs.apify.com/academy/web-scraping-for-beginners.
In this case, the crawler will only crawl the links beginning with academy/.
You can add other URLs to the list, as well. These will be added to the crawler queue, and the Actor will process them one by one.
You can use the Text file option for batch processing if you have lots of URLs and want to crawl them all. You can either upload a file with a list that has each URL on a separate line, or you can provide a URL of the file.
2. Crawler settings
Crawler type
The default crawler type is Firefox. This can load most pages and is usually better for anti-bot blocking, but it’s the slowest option. Apify has set it as the default because it gets you the most consistent results. However, it requires more compute units and takes longer, and therefore costs more.
If you need a browser or wish to render client-side JavaScript, you can use the Chrome browser instead. It's faster and requires less memory, but keep in mind that it's more detectable by anti-bot protections.
Use the Raw HTTP client (Cheerio) if you don’t need JavaScript client-side rendering, as it will be 20 times faster.
If you feel like experimenting, you could try the experimental?Adaptive switching option. In this playwright:adaptive?mode, Website Content Crawler detects pages that can be processed with plain HTTP crawling. This makes it faster than regular?playwright?mode while also being able to process dynamic content. In common cases, it should save you time spent deciding between?playwright?,?puppeteer? and?cheerio.
If you're feeling adventurous, you could try the experimental JSDOM option. It's much faster than browsers and provides some JS execution support. However, at this point, JSDOM’s coverage of standard web APIs is still incomplete (see this ancient issue tracking the still missing fetch implementation). So I can't wholeheartedly recommend it.
Exclude URLs (globs)
By default, the crawler will visit all the web pages in the Start URLs field (plus all the linked pages - but only if their path prefixes match). However, there might be some you don’t want to visit. If that's the case, you can use the exclude URLs (globs) option.
You can also check if the glob matches what you want with the Test Glob button.
领英推荐
Initial cookies
Cookies are sometimes used to identify the user with the server it’s trying to access. You can use the initial cookies option if you want to access content behind a log-in or authenticate your crawler with the website you’re scraping. Here are a couple of examples.
3. HTML processing
There are two steps to HTML processing: a) waiting for content to load and b) processing the HTML from the web page (data cleaning). Although the UI doesn't strictly follow this order, I've decided to break it up this way: 3. HTML processing and 4. Data cleaning.
Wait for dynamic content
Some web pages have lazy loading, which is when the web page is loading more content as you scroll down. In such cases, you can tell the crawler to wait for dynamic content to load. The crawler will wait for up to 10 seconds as long as the web page is changing.
Maximum scroll height
The maximum scroll height is the height at which you scroll down before starting to process the page. This is there just to prevent infinite scrolling. Imagine an online store loading more and more products as you scroll, for example.
Remove cookie warnings
Once the content is loaded, the crawler may try to click on the cookie modals. With the remove cookie warnings option, it will click and hide the modals. It's enabled by default.
Expand clickable elements
The expand clickable elements option lets you add a selector of things the crawler should click on. If you don't select this, the Actor won't crawl any links in collapsed content. So, use this option to scrape content from collapsed sections of the webpage.
4. Data cleaning
Remove HTML elements
You can clean the data by removing HTML elements. These are the selectors of things you don’t want to include in your results (banners, ads, menus, alerts, and so on). The default setting covers most things, but you can add more to the list if you need to. This way, you'll have only the content you need to feed your language model.
HTML transformer
With this option, you can try to remove more elements, but it may strip useful parts of the content you want to extract. So, if you discover that this is the case after running the Actor, you can choose None.
Remove duplicate text lines
You can remove duplicate text lines if the crawler keeps seeing the same line again and again. You can enable this in case you keep seeing some parts of footers or menus in your output, but you don’t want to look for the correct CSS selectors. The Actor strips the repeated content after 4 or 5 occurrences. This will prevent saving the same information repeatedly and so keep the data clean.
5. Output settings
You can save the data as HTML or Markdown or save screenshots if you're using a headless browser. The Save files option deserves some special attention, though.
If you choose Save files, the crawler inspects the web page, and whenever it sees a link that goes to, say a PDF, Word doc, or Excel sheet, it will download it to the Apify key-value store.
6. Running the Actor
With the UI, you can execute code with the click of a button (the Start button at the bottom of the screen).
While running, you'll see what the crawler is up to in the log and can check if it's experiencing any issues. You can abort the run at any point.
When the crawler has completed a successful run, you can retrieve the data from the output tab.
7. Storing the data
The results of the Actor are stored in the default Dataset associated with the Actor run, from where you can access it via API and export it to formats like JSON, XML, or CSV.
With the UI, you need only click the Export results button to view or download the data in your preferred format.
By way of example, here's the data in JSON from the first of the 26 results I got from this demo run using the UI's default settings.
{
"url": "https://docs.apify.com/academy/web-scraping-for-beginners",
"crawl": {
"loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
"loadedTime": "2023-08-01T09:48:51.180Z",
"referrerUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
"depth": 0,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
"title": "Web scraping for beginners | Academy | Apify Documentation",
"description": "Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.",
"author": null,
"keywords": null,
"languageCode": "en"
},
"screenshotUrl": null,
"text": "Web scraping for beginners\nLearn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.\nWelcome to Web scraping for beginners, a comprehensive, ........"
}
I would love to know if you are building some interesting software or AI using data scrapping. And feel free to reach out to discuss if you have any questions.
I am working on a blueprint course, which contains my extensive experience and would change the landscape of developing softwares and managing them. Get early insights here ??
Subscribe for early access : https://mailchi.mp/f6c30b8dcf37/software-blueprinting
Senior Content Producer & Editor
6 个月I'm flattered you found my article so interesting ??: https://blog.apify.com/webscraping-ai-data-for-llms/