登录查看更多内容

An Effective Content Extraction Workflow That I Trust

Frank Enendu - MSc

AI Engineer | ML Engineer | Software Engineer | Unicorn Data Scientist

发布日期: 2025年3月15日

Extracting structured data from documents has always been a challenge for me as an AI developer. I've spent countless hours trying to get accurate information out of PDFs, Word docs, PowerPoint slides, spreadsheets, and even images. Each method I tried had its pitfalls. Some tools gave me jumbled text that lost the original formatting, while others completely missed out on tables or images. At times, I even resorted to manual copy-paste or overly complex scripts, which was frustrating and unsustainable. In short, getting reliable results from diverse documents felt like searching for a needle in a haystack.

I eventually realized that no single traditional library could tick all the boxes. One would extract text but lose the document structure; another might handle PDFs okay but choke on images or scanned PDFs (OCR was a separate headache). I wanted a solution that handled all file types in one workflow and delivered clean, structured output I could trust. After much trial and error, I finally stumbled upon a workflow that changed everything. It combines a powerful open-source tool with the intelligence of language models, and it works like a charm for my use cases. Let me break down this workflow and explain why I believe it’s the best approach for document extraction in AI projects.

In a nutshell, my workflow goes from raw documents to structured data in five clear steps. It involves converting any document (PDF, DOCX, PPTX, CSV, or image) into Markdown text using Docling, then using an LLM (or a RAG pipeline) to extract the needed information, and finally structuring that information as JSON using a Pydentic model. The figure below illustrates this pipeline step by step.

The Workflow (Step by Step)

Use Docling to process different file types: The first step is feeding the documents into Docling. Docling is an open-source library (from IBM) that can parse a wide range of formats – PDFs, Word documents, Excel sheets, PowerPoint slides, HTML, images, you name it – in a unified way. This is huge, because previously I had to use separate tools for each format. Docling handles them all under the hood, even performing OCR on scanned PDFs or images if needed. In practice, I point Docling to the file (or batch of files), and it extracts all the text content along with structural elements like headings, lists, tables, and even image captions.
Convert content to Markdown: A killer feature of Docling is that it exports the extracted content as Markdown. Why Markdown? Because it preserves the structure and formatting (like headings, italic text, bullet points, tables, etc.) in a plain-text format that is easy for Large Language Models (LLMs) to understand. Markdown strikes a great balance: it's human-readable and retains context (for example, showing that something was a table or a heading in the original document). In my experience, LLMs handle well-formatted Markdown much better than a blob of plain text that has no indication of structure. By converting everything to Markdown early on, we set the stage for the AI to grasp the document's layout and meaning more accurately.

from docling import Docling

# Example of processing a PDF file with Docling
docling = Docling(file_path="example.pdf")
markdown_text = docling.to_markdown()

print(markdown_text)

Feed the Markdown to an LLM or RAG pipeline for extraction: Once I have the document content in Markdown, the next step is to extract the specific data I need using an LLM. Depending on the scenario, this could be a direct prompt to a Large Language Model (like GPT-4 or an open-source model) or part of a Retrieval-Augmented Generation (RAG) pipeline. In a simple case, I might prompt the LLM with the Markdown text (or chunks of it) and ask it to find the information I care about – for example, key insights in a report, specific fields from a form, or a summary of a presentation. The LLM, having the nicely formatted Markdown, can parse through it with greater comprehension. In cases where I have a lot of documents or very large ones, I might use a RAG approach: indexing the Markdown content (e.g., with a vector database) and then querying it so that the LLM only sees the most relevant chunks. Either way, the AI does the heavy lifting here in understanding the content and pulling out the data we're after. This step is where unstructured text becomes structured information via AI's understanding.

领英推荐

Custom LLMs vs. Prebuilt LLMs: Which Path is Right for…

Vinay Reddy 9 个月前

Mistral OCR: Unlocking the Document Intelligence…

Yvette Schmitter 2 周前

Deep Dive into GCP GenAI Services: A Layered Approach

Dr. Rabi Prasad Padhy 11 个月前

from openai import OpenAI

# Sample prompt to extract data from Markdown text
response = OpenAI.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": f"Extract key information from this document:\n{markdown_text}"}]
)

extracted_data = response['choices'][0]['message']['content']
print(extracted_data)

Structure the output with a Pydantic BaseModel: To ensure the data extracted by the LLM is well-organized and follows a schema, I use a Pydantic BaseModel in Python as a blueprint. Pydantic is a library that makes it easy to define data models (think of them like schemas) and validate data against them. In this workflow, I define a Pydantic model that describes exactly what fields and structure I expect in the output. For example, if I'm extracting financial report data, I might define a model with fields like company_name, report_date, total_revenue, etc., with appropriate data types. After the LLM extracts information (often I instruct the LLM to output in JSON format adhering to this schema), I parse that output using the Pydantic model. This does two things: (a) it validates that the AI's output actually matches the schema (catching any errors or omissions), and (b) it automatically organizes the data into a Python object I can easily work with. Essentially, Pydantic acts as a safety net and organizer, converting the free-form extraction into a well-defined structure.

from pydantic import BaseModel

# Define the expected JSON structure using Pydantic
class DocumentData(BaseModel):
    company_name: str
    report_date: str
    total_revenue: float

# Example of validating LLM output
data = DocumentData.parse_raw(extracted_data)
print(data.json())

Produce the final output in JSON format: The last step is getting a clean JSON output. By the time we've passed the LLM's Markdown-derived extraction through the Pydantic model, we have a structured object that can be serialized to JSON (Python's model.json() or model.dict() makes this trivial). The end result is that for each document, I get a JSON document containing all the fields or information I wanted, neatly organized. This JSON output is gold for further processing: you can feed it into databases, use it in analytics pipelines, or send it to other services. JSON being a universal format means this data is immediately usable for downstream tasks, which is a huge win compared to the messy text I used to end up with in earlier approaches.

Conclusion & Next Steps

Adopting this document extraction workflow has been a game-changer for me. It turned a formerly tedious and error-prone task into a smooth, reliable process. Instead of spending hours cleaning up outputs or wrestling with different libraries, I can now focus on what to do with the extracted data, which is the fun part for an AI developer. If you’ve been frustrated with document data extraction, I highly recommend giving this approach a try – it might save you from pulling out your hair as it did for me!

Moving forward, I plan to create a walkthrough video demonstrating this end to end. In the video, I'll take a sample set of documents and show how each step comes together – from running Docling to prompting the LLM and using Pydantic to get that perfect JSON output. Keep an eye out for that if you're interested in seeing the workflow in action.

Feel free to drop your thoughts or questions in the comments. Have you faced similar struggles with document extraction, and what solutions have you tried? I’m curious to hear others’ experiences. For me, this combination of Docling, Markdown, LLM, and Pydantic finally cracked the code, and I'm excited to see it help others as well. Here’s to making our AI development lives a little easier!

Joy Onuoha

2 周

Thanks for sharing Frank

1 次回应

要查看或添加评论，请登录

Frank Enendu - MSc的更多文章

Evolving Data Science Careers in the Age of Generative AI ????

2023年6月26日

Evolving Data Science Careers in the Age of Generative AI ????

Introduction: The data science field is experiencing a significant transformation due to the emergence of new and…

1 条评论

An Effective Content Extraction Workflow That I Trust

Frank Enendu - MSc

AI Engineer | ML Engineer | Software Engineer | Unicorn Data Scientist

The Workflow (Step by Step)

领英推荐

Conclusion & Next Steps

Frank Enendu - MSc的更多文章

社区洞察

其他会员也浏览了

The Essential RAG Developer's Stack: A Comprehensive Guide to Modern AI Tools

December 06, 2023

?? LangChain, Semantic Kernel, and Beyond: Top Tools for AI Agent Development

Advanced Retrieval Augmented Generation (RAG) with Reranking

Top 16 AI Frameworks and Libraries: A Beginner's Guide

Managing an Enterprise Knowledge Base with LLM Deployment and RAG Concept

Quaestor-AI: An Extensible Framework for Advanced Retrieval-Augmented Generation

Basic Strategy for the GxP validation of AI/Machine Learning solutions

I built a chatbot using a custom data source, powered by LlamaIndex, OpenAI, and Streamlit, and so can you!

The Workflow (Step by Step)

领英推荐

Conclusion & Next Steps

Frank Enendu - MSc的更多文章

Evolving Data Science Careers in the Age of Generative AI ????

社区洞察

其他会员也浏览了

The Essential RAG Developer's Stack: A Comprehensive Guide to Modern AI Tools

December 06, 2023

?? LangChain, Semantic Kernel, and Beyond: Top Tools for AI Agent Development

Advanced Retrieval Augmented Generation (RAG) with Reranking

Top 16 AI Frameworks and Libraries: A Beginner's Guide

Managing an Enterprise Knowledge Base with LLM Deployment and RAG Concept

Quaestor-AI: An Extensible Framework for Advanced Retrieval-Augmented Generation

Basic Strategy for the GxP validation of AI/Machine Learning solutions

I built a chatbot using a custom data source, powered by LlamaIndex, OpenAI, and Streamlit, and so can you!