?? Structured Data Extraction: Traditional CSS Selectors vs. OpenAI LLMs ??

?? Structured Data Extraction: Traditional CSS Selectors vs. OpenAI LLMs ??


  1. Quick Start with Crawl4AI
  2. Extracting Data with CSS Selectors (Traditional Method)
  3. Extracting Data with OpenAI LLMs




Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. ????



1) Quick Start with Crawl4AI



Get started with Crawl4AI in just two steps:


pip install Crawl4AI        


Write a simple script to extract data:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())        


This script prints the page content in Markdown format, making it great for a quick summary.



(2) Extracting Data with CSS Selectors (Traditional Method)

For structured pages, CSS Selectors provide precision and speed. For example, extracting news headlines and links from a business news page:




import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def extract_news():
    schema = {
        "baseSelector": ".wide-tease-item__wrapper",
        "fields": [
            {"name": "headline", "selector": ".wide-tease-item__headline", "type": "text"},
            {"name": "link", "selector": "a[href]", "type": "attribute", "attribute": "href"},
        ],
    }
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
        )
        print(result.extracted_content)

asyncio.run(extract_news())        


?? Best For: Well-structured web pages with predictable HTML.


(3) 3?? Extracting Data with OpenAI LLMs



For semantic or unstructured data, OpenAI LLMs can extract and structure data intelligently. Here's an example to extract pricing data from OpenAI's API pricing page:


import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input tokens.")
    output_fee: str = Field(..., description="Fee for output tokens.")

async def extract_openai_pricing():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/",
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o",
                api_token=os.getenv('OPENAI_API_KEY'),
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""Extract all models and their input/output fees."""
            ),
        )
        print(result.extracted_content)

asyncio.run(extract_openai_pricing())        


?? Best For: Complex or unstructured pages requiring semantic understanding.


When to Use What?

  • CSS Selectors: Use for structured, predictable pages where HTML elements are clearly defined.
  • OpenAI LLMs: Use for unstructured, dynamic, or semantic data extraction.


Both methods are powerful, depending on your use case. Whether you're dealing with straightforward HTML or complex content, Crawl4AI has you covered. Which method fits your needs? Let’s discuss! ??

#WebScraping #DataExtraction #OpenAI #Crawl4AI #Python #CSSSelectors #LLMs #TechTools



that's wrap for today!




要查看或添加评论,请登录