登录查看更多内容

Unlocking the power of scrapper to power your AI & Software

Rohan Girdhani (The TechDoc)

I help you make softwares that your customers can’t ignore and systems that can support your next big milestone | DM if you are struggling with your tech, architecture or security.

发布日期: 2024年3月23日

A tutorial that shows you how to crawl, extract, and process web data to feed, fine-tune, or train large language models.

Generative AI solutions begin with web scraping

I've been going on about big brain AI models for a while. They're cool and all, but let's not think they can do everything. Sometimes, people get too excited and think these AIs are magic.

The thing is, these AI models can't always give you new or right info. When you need the latest facts, web scraping is what you want. It's like, if AIs are looking in a old book, web scraping gives you today's news.

?? As of September 27, 2023, GPT-4's knowledge is no longer limited to data before September 2021:

Web scraping isn't just a trick to teach those big AI brains, it's also the secret sauce that devs use to make them even smarter and more personal.

With web scraping tools (like the one in the upcoming guide), you can give your AI models a snack, tweak them to perfection, or even train them from scratch. You can also use it to give ChatGPT and its AI buddies some extra details to chew on. This magic trick is useful for loads of stuff, including:

Creating custom chatbots for customer support
Generating personalized content
Summarizing, translating, and proofreading texts at scale

Introducing Website Content Crawler for data ingestion

To feed and fine-tune LLMs, it's not enough to just scrape data. You need to process and clean it before you can use it for generative AI and machine learning. So, in this tutorial, I'm going to use Website Content Crawler, which was designed specifically for this purpose. This guide will demonstrate why WCC is useful for collecting data for LLMs.

Website Content Crawler is what Apify calls an Actor (a serverless cloud program). Actors can perform anything from a simple action, such as filling out a web form or sending an email, to complex operations, such as crawling an entire website and removing duplicates from a large dataset.

Like all Apify Actors, you can run WCC via:

Web UI
Apify API
Apify CLI

If you're new to Apify, using the UI is the easiest way to test it out, so that's the method I'm going to use in this tutorial.

To use this tool and follow along with me, go to Website Content Crawler in Apify Store and click the Try for free button.

You'll need an Apify account. If you don't have one, you'll be prompted to sign up when you click that button.

Otherwise, you'll be taken straight to Apify Console (which is basically your dashboard), and you'll see the UI that I'm about to walk you through.

How to collect web data for LLMs

1. Start URLs

I'm going to use the default input and scrape the Apify docs using the following start URL: https://docs.apify.com/academy/web-scraping-for-beginners.

In this case, the crawler will only crawl the links beginning with academy/.

You can add other URLs to the list, as well. These will be added to the crawler queue, and the Actor will process them one by one.

You can use the Text file option for batch processing if you have lots of URLs and want to crawl them all. You can either upload a file with a list that has each URL on a separate line, or you can provide a URL of the file.

2. Crawler settings

Crawler type

The default crawler type is Firefox. This can load most pages and is usually better for anti-bot blocking, but it’s the slowest option. Apify has set it as the default because it gets you the most consistent results. However, it requires more compute units and takes longer, and therefore costs more.

If you need a browser or wish to render client-side JavaScript, you can use the Chrome browser instead. It's faster and requires less memory, but keep in mind that it's more detectable by anti-bot protections.

Use the Raw HTTP client (Cheerio) if you don’t need JavaScript client-side rendering, as it will be 20 times faster.

If you feel like experimenting, you could try the experimental?Adaptive switching option. In this playwright:adaptive?mode, Website Content Crawler detects pages that can be processed with plain HTTP crawling. This makes it faster than regular?playwright?mode while also being able to process dynamic content. In common cases, it should save you time spent deciding between?playwright?,?puppeteer? and?cheerio.

If you're feeling adventurous, you could try the experimental JSDOM option. It's much faster than browsers and provides some JS execution support. However, at this point, JSDOM’s coverage of standard web APIs is still incomplete (see this ancient issue tracking the still missing fetch implementation). So I can't wholeheartedly recommend it.

Exclude URLs (globs)

By default, the crawler will visit all the web pages in the Start URLs field (plus all the linked pages - but only if their path prefixes match). However, there might be some you don’t want to visit. If that's the case, you can use the exclude URLs (globs) option.

You can also check if the glob matches what you want with the Test Glob button.

Data Science Dojo 1 年前

OpenAI’s New GPT-4o Mini Is Giving Competitors A Run…

ARK Investment Management LLC 2 个月前

Exploring Ai Hallucinations: A Collection of my Recent…

Cohen Reuven 2 个月前

Initial cookies

Cookies are sometimes used to identify the user with the server it’s trying to access. You can use the initial cookies option if you want to access content behind a log-in or authenticate your crawler with the website you’re scraping. Here are a couple of examples.

3. HTML processing

There are two steps to HTML processing: a) waiting for content to load and b) processing the HTML from the web page (data cleaning). Although the UI doesn't strictly follow this order, I've decided to break it up this way: 3. HTML processing and 4. Data cleaning.

Wait for dynamic content

Some web pages have lazy loading, which is when the web page is loading more content as you scroll down. In such cases, you can tell the crawler to wait for dynamic content to load. The crawler will wait for up to 10 seconds as long as the web page is changing.

Maximum scroll height

The maximum scroll height is the height at which you scroll down before starting to process the page. This is there just to prevent infinite scrolling. Imagine an online store loading more and more products as you scroll, for example.

Remove cookie warnings

Once the content is loaded, the crawler may try to click on the cookie modals. With the remove cookie warnings option, it will click and hide the modals. It's enabled by default.

Expand clickable elements

The expand clickable elements option lets you add a selector of things the crawler should click on. If you don't select this, the Actor won't crawl any links in collapsed content. So, use this option to scrape content from collapsed sections of the webpage.

4. Data cleaning

Remove HTML elements

You can clean the data by removing HTML elements. These are the selectors of things you don’t want to include in your results (banners, ads, menus, alerts, and so on). The default setting covers most things, but you can add more to the list if you need to. This way, you'll have only the content you need to feed your language model.

HTML transformer

With this option, you can try to remove more elements, but it may strip useful parts of the content you want to extract. So, if you discover that this is the case after running the Actor, you can choose None.

Remove duplicate text lines

You can remove duplicate text lines if the crawler keeps seeing the same line again and again. You can enable this in case you keep seeing some parts of footers or menus in your output, but you don’t want to look for the correct CSS selectors. The Actor strips the repeated content after 4 or 5 occurrences. This will prevent saving the same information repeatedly and so keep the data clean.

5. Output settings

You can save the data as HTML or Markdown or save screenshots if you're using a headless browser. The Save files option deserves some special attention, though.

If you choose Save files, the crawler inspects the web page, and whenever it sees a link that goes to, say a PDF, Word doc, or Excel sheet, it will download it to the Apify key-value store.

6. Running the Actor

With the UI, you can execute code with the click of a button (the Start button at the bottom of the screen).

While running, you'll see what the crawler is up to in the log and can check if it's experiencing any issues. You can abort the run at any point.

When the crawler has completed a successful run, you can retrieve the data from the output tab.

7. Storing the data

The results of the Actor are stored in the default Dataset associated with the Actor run, from where you can access it via API and export it to formats like JSON, XML, or CSV.

With the UI, you need only click the Export results button to view or download the data in your preferred format.

By way of example, here's the data in JSON from the first of the 26 results I got from this demo run using the UI's default settings.

{
  "url": "https://docs.apify.com/academy/web-scraping-for-beginners",
  "crawl": {
    "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
    "loadedTime": "2023-08-01T09:48:51.180Z",
    "referrerUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
    "depth": 0,
    "httpStatusCode": 200
  },
  "metadata": {
    "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
    "title": "Web scraping for beginners | Academy | Apify Documentation",
    "description": "Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.",
    "author": null,
    "keywords": null,
    "languageCode": "en"
  },
  "screenshotUrl": null,
  "text": "Web scraping for beginners\nLearn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.\nWelcome to Web scraping for beginners, a comprehensive, ........"
}

I would love to know if you are building some interesting software or AI using data scrapping. And feel free to reach out to discuss if you have any questions.

I am working on a blueprint course, which contains my extensive experience and would change the landscape of developing softwares and managing them. Get early insights here ??

Subscribe for early access : https://mailchi.mp/f6c30b8dcf37/software-blueprinting

The Tech Saturday

327 位关注者

Theo Vasilis

Senior Content Producer & Editor

6 个月

I'm flattered you found my article so interesting ??: https://blog.apify.com/webscraping-ai-data-for-llms/

要查看或添加评论，请登录

Rohan Girdhani (The TechDoc)的更多文章

Should we make conscious software like our movies?

2024年10月12日

Should we make conscious software like our movies?

We are continuously been delivered movies and tv shows that show science fiction with humanoid artificial intelligence…

1 条评论
CI/CD solved for your team with Github Actions. The easiest and effective way to introduce CI/CD.

2024年10月5日

CI/CD solved for your team with Github Actions. The easiest and effective way to introduce CI/CD.

My first interaction to CI/CD was like a wake up moment. Before github actions, i was using various tools like Jenkins…
Why the customer is not coming back to your SAAS?

2024年9月28日

Why the customer is not coming back to your SAAS?

The natural tendency of your customers and vendors is to move away from your product and not the other way round. In…

3 条评论
The key ingredients that make up an amazing software.

2024年9月21日

The key ingredients that make up an amazing software.

This edition of the newsletter contains quick write up and full video about - How fancy our mindset is about software…

4 条评论
Startups are the most lucrative targets for the hackers.

2024年9月14日

Startups are the most lucrative targets for the hackers.

Cybersecurity is a daunting subject for most small and new businesses. They don’t have the budget to hire the best in…

4 条评论
The future is personalization. Don't just build software. Build experiences.

2024年9月7日

The future is personalization. Don't just build software. Build experiences.

Most SaaS companies know that customers expect a personalized experience across multiple channels. That said, not all…

3 条评论
Let's bring AI to testing and take off some load.

2024年8月31日

Let's bring AI to testing and take off some load.

In the ever-evolving landscape of software development, AI Testing emerges as a prominent force, promising not only…

3 条评论
How to build Data Pipeline on AWS

2024年8月24日

How to build Data Pipeline on AWS

When starting to dive into the data world you will see that there are a lot of approaches you can go for and a lot of…

2 条评论
Don't get hacked. Every developer should be aware of this essential security checklist.

2024年8月17日

Don't get hacked. Every developer should be aware of this essential security checklist.

In today’s digital world, cyberattacks are a constant threat. For developers, building secure software isn’t optional –…
Truth about software industry. No one is talking about.

2024年8月3日

Truth about software industry. No one is talking about.

Last week i found at least 6-7 people while interviewing candidates for a senior role. The common story about each one…

See all articles

Unlocking the power of scrapper to power your AI & Software

Rohan Girdhani (The TechDoc)

I help you make softwares that your customers can’t ignore and systems that can support your next big milestone | DM if you are struggling with your tech, architecture or security.

Generative AI solutions begin with web scraping

Introducing Website Content Crawler for data ingestion

How to collect web data for LLMs

1. Start URLs

2. Crawler settings

Crawler type

Exclude URLs (globs)

领英推荐

Initial cookies

3. HTML processing

Wait for dynamic content

Maximum scroll height

Remove cookie warnings

Expand clickable elements

4. Data cleaning

Remove HTML elements

HTML transformer

Remove duplicate text lines

5. Output settings

6. Running the Actor

7. Storing the data

The Tech Saturday

327 位关注者

Rohan Girdhani (The TechDoc)的更多文章

社区洞察

其他会员也浏览了

?? Infinite Text Input? This changes everything.

Positive Thinking Company Newsletter November 2023

Using Kor (LangChain Extension), Generative Language Models & Prompt Engineering

PandasAI: Shaping the Future of Conversational Data Analysis

Top Node.js Libraries for AI Integration: comparing with code examples

AI2’s AllenNLP, Grover, and GPT-2 For Practical Content Generation

How to Create an AI Writer or Chatbot Tool using GPT-3

An innovative AI idea!

AI lets anyone generate code. That isn’t a good thing.

Generative AI solutions begin with web scraping

Introducing Website Content Crawler for data ingestion

How to collect web data for LLMs

1. Start URLs

2. Crawler settings

Crawler type

Exclude URLs (globs)

领英推荐

Initial cookies

3. HTML processing

Wait for dynamic content

Maximum scroll height

Remove cookie warnings

Expand clickable elements

4. Data cleaning

Remove HTML elements

HTML transformer

Remove duplicate text lines

5. Output settings

6. Running the Actor

7. Storing the data

The Tech Saturday

327 位关注者

Rohan Girdhani (The TechDoc)的更多文章

Should we make conscious software like our movies?

CI/CD solved for your team with Github Actions. The easiest and effective way to introduce CI/CD.

Why the customer is not coming back to your SAAS?

The key ingredients that make up an amazing software.

Startups are the most lucrative targets for the hackers.

The future is personalization. Don't just build software. Build experiences.

Let's bring AI to testing and take off some load.

How to build Data Pipeline on AWS

Don't get hacked. Every developer should be aware of this essential security checklist.

Truth about software industry. No one is talking about.

社区洞察

其他会员也浏览了

?? Infinite Text Input? This changes everything.

Positive Thinking Company Newsletter November 2023

Using Kor (LangChain Extension), Generative Language Models & Prompt Engineering

PandasAI: Shaping the Future of Conversational Data Analysis

Top Node.js Libraries for AI Integration: comparing with code examples

AI2’s AllenNLP, Grover, and GPT-2 For Practical Content Generation

How to Create an AI Writer or Chatbot Tool using GPT-3

An innovative AI idea!

AI lets anyone generate code. That isn’t a good thing.