An Effective Content Extraction Workflow That I Trust
Frank Enendu - MSc
AI Engineer | ML Engineer | Software Engineer | Unicorn Data Scientist
Extracting structured data from documents has always been a challenge for me as an AI developer. I've spent countless hours trying to get accurate information out of PDFs, Word docs, PowerPoint slides, spreadsheets, and even images. Each method I tried had its pitfalls. Some tools gave me jumbled text that lost the original formatting, while others completely missed out on tables or images. At times, I even resorted to manual copy-paste or overly complex scripts, which was frustrating and unsustainable. In short, getting reliable results from diverse documents felt like searching for a needle in a haystack.
I eventually realized that no single traditional library could tick all the boxes. One would extract text but lose the document structure; another might handle PDFs okay but choke on images or scanned PDFs (OCR was a separate headache). I wanted a solution that handled all file types in one workflow and delivered clean, structured output I could trust. After much trial and error, I finally stumbled upon a workflow that changed everything. It combines a powerful open-source tool with the intelligence of language models, and it works like a charm for my use cases. Let me break down this workflow and explain why I believe it’s the best approach for document extraction in AI projects.
In a nutshell, my workflow goes from raw documents to structured data in five clear steps. It involves converting any document (PDF, DOCX, PPTX, CSV, or image) into Markdown text using Docling, then using an LLM (or a RAG pipeline) to extract the needed information, and finally structuring that information as JSON using a Pydentic model. The figure below illustrates this pipeline step by step.
The Workflow (Step by Step)
from docling import Docling
# Example of processing a PDF file with Docling
docling = Docling(file_path="example.pdf")
markdown_text = docling.to_markdown()
print(markdown_text)
领英推荐
from openai import OpenAI
# Sample prompt to extract data from Markdown text
response = OpenAI.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": f"Extract key information from this document:\n{markdown_text}"}]
)
extracted_data = response['choices'][0]['message']['content']
print(extracted_data)
from pydantic import BaseModel
# Define the expected JSON structure using Pydantic
class DocumentData(BaseModel):
company_name: str
report_date: str
total_revenue: float
# Example of validating LLM output
data = DocumentData.parse_raw(extracted_data)
print(data.json())
Conclusion & Next Steps
Adopting this document extraction workflow has been a game-changer for me. It turned a formerly tedious and error-prone task into a smooth, reliable process. Instead of spending hours cleaning up outputs or wrestling with different libraries, I can now focus on what to do with the extracted data, which is the fun part for an AI developer. If you’ve been frustrated with document data extraction, I highly recommend giving this approach a try – it might save you from pulling out your hair as it did for me!
Moving forward, I plan to create a walkthrough video demonstrating this end to end. In the video, I'll take a sample set of documents and show how each step comes together – from running Docling to prompting the LLM and using Pydantic to get that perfect JSON output. Keep an eye out for that if you're interested in seeing the workflow in action.
Feel free to drop your thoughts or questions in the comments. Have you faced similar struggles with document extraction, and what solutions have you tried? I’m curious to hear others’ experiences. For me, this combination of Docling, Markdown, LLM, and Pydantic finally cracked the code, and I'm excited to see it help others as well. Here’s to making our AI development lives a little easier!
Data Scientist | Building Predictive Models for Optimisation | Machine Learning | GBM | GLM | Radar | Pyspark | Databricks | Emblem | Azure DevOps Services
2 周Thanks for sharing Frank