?? Structured Data Extraction: Traditional CSS Selectors vs. OpenAI LLMs ??
Martin Khristi
AI & Machine Learning Advocate | BI & Data Specialist at CA Karrierepartner | Microsoft Fabric Enthusiast | Python for Data, AI & Time Series Forecasting | Supporter and Contributor in PandasAI
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. ????
1) Quick Start with Crawl4AI
Get started with Crawl4AI in just two steps:
pip install Crawl4AI
Write a simple script to extract data:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
This script prints the page content in Markdown format, making it great for a quick summary.
(2) Extracting Data with CSS Selectors (Traditional Method)
For structured pages, CSS Selectors provide precision and speed. For example, extracting news headlines and links from a business news page:
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def extract_news():
schema = {
"baseSelector": ".wide-tease-item__wrapper",
"fields": [
{"name": "headline", "selector": ".wide-tease-item__headline", "type": "text"},
{"name": "link", "selector": "a[href]", "type": "attribute", "attribute": "href"},
],
}
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
)
print(result.extracted_content)
asyncio.run(extract_news())
?? Best For: Well-structured web pages with predictable HTML.
(3) 3?? Extracting Data with OpenAI LLMs
For semantic or unstructured data, OpenAI LLMs can extract and structure data intelligently. Here's an example to extract pricing data from OpenAI's API pricing page:
import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input tokens.")
output_fee: str = Field(..., description="Fee for output tokens.")
async def extract_openai_pricing():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://openai.com/api/pricing/",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""Extract all models and their input/output fees."""
),
)
print(result.extracted_content)
asyncio.run(extract_openai_pricing())
?? Best For: Complex or unstructured pages requiring semantic understanding.
When to Use What?
Both methods are powerful, depending on your use case. Whether you're dealing with straightforward HTML or complex content, Crawl4AI has you covered. Which method fits your needs? Let’s discuss! ??
#WebScraping #DataExtraction #OpenAI #Crawl4AI #Python #CSSSelectors #LLMs #TechTools
that's wrap for today!