登录查看更多内容

?? Structured Data Extraction: Traditional CSS Selectors vs. OpenAI LLMs ??

Martin Khristi

AI & Machine Learning Advocate | BI & Data Specialist at CA Karrierepartner | Microsoft Fabric Enthusiast | Python for Data, AI & Time Series Forecasting | Supporter and Contributor in PandasAI

发布日期: 2024年11月24日

Quick Start with Crawl4AI
Extracting Data with CSS Selectors (Traditional Method)
Extracting Data with OpenAI LLMs

Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. ????

1) Quick Start with Crawl4AI

Get started with Crawl4AI in just two steps:

pip install Crawl4AI

Write a simple script to extract data:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

This script prints the page content in Markdown format, making it great for a quick summary.

(2) Extracting Data with CSS Selectors (Traditional Method)

For structured pages, CSS Selectors provide precision and speed. For example, extracting news headlines and links from a business news page:

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def extract_news():
    schema = {
        "baseSelector": ".wide-tease-item__wrapper",
        "fields": [
            {"name": "headline", "selector": ".wide-tease-item__headline", "type": "text"},
            {"name": "link", "selector": "a[href]", "type": "attribute", "attribute": "href"},
        ],
    }
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
        )
        print(result.extracted_content)

asyncio.run(extract_news())

?? Best For: Well-structured web pages with predictable HTML.

(3) 3?? Extracting Data with OpenAI LLMs

For semantic or unstructured data, OpenAI LLMs can extract and structure data intelligently. Here's an example to extract pricing data from OpenAI's API pricing page:

import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input tokens.")
    output_fee: str = Field(..., description="Fee for output tokens.")

async def extract_openai_pricing():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/",
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o",
                api_token=os.getenv('OPENAI_API_KEY'),
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""Extract all models and their input/output fees."""
            ),
        )
        print(result.extracted_content)

asyncio.run(extract_openai_pricing())

?? Best For: Complex or unstructured pages requiring semantic understanding.

When to Use What?

CSS Selectors: Use for structured, predictable pages where HTML elements are clearly defined.
OpenAI LLMs: Use for unstructured, dynamic, or semantic data extraction.

Both methods are powerful, depending on your use case. Whether you're dealing with straightforward HTML or complex content, Crawl4AI has you covered. Which method fits your needs? Let’s discuss! ??

#WebScraping #DataExtraction #OpenAI #Crawl4AI #Python #CSSSelectors #LLMs #TechTools

that's wrap for today!

?? Structured Data Extraction: Traditional CSS Selectors vs. OpenAI LLMs ??

Martin Khristi

AI & Machine Learning Advocate | BI & Data Specialist at CA Karrierepartner | Microsoft Fabric Enthusiast | Python for Data, AI & Time Series Forecasting | Supporter and Contributor in PandasAI

(2) Extracting Data with CSS Selectors (Traditional Method)

When to Use What?

AI Insights

409 位关注者

更多精彩文章

(2) Extracting Data with CSS Selectors (Traditional Method)

When to Use What?

AI Insights

409 位关注者

Which Countries Are Most Prepared For AI? ??

2024年11月16日

Here's what's new today in the AI Insights

2024年11月5日

Here's what's new today in the AI Insights

2024年10月29日

Here's what's new today in the AI Insights

2024年10月19日

Here's what's new today in the Horizon AI

2024年10月15日

Quick Start Guide neuralprophet

2024年10月10日

Forcasting at Scale with Prophet from Facebook open source model for time seires forcasting

2024年9月28日

Here's what's new today in the Horizon AI

2024年9月25日

Accelerated Data Analytics: Machine Learning with GPU-Accelerated Pandas and Scikit-learn

2024年9月21日

Key Takeaways from the Microsoft 365 Copilot Wave 2 Event Hosted by Satya Nadella, CEO of Microsoft

2024年9月16日