Building Document Parsing Pipelines with Python

Building Document Parsing Pipelines with Python

Why Parse Documents?

Recently, I've been working with various document parsing challenges at work. Our systems generate many types of documents daily - text files, Word documents, Excel spreadsheets, and PDFs. These documents often contain valuable information that can improve data-driven decision-making process.

Building automated document parsing pipelines can streamline the extraction of important information from countless files. These pipelines are systems that automatically extract and process document data, transforming it into usable formats.

Imagine having a tool that reads through all your files, identifies critical information, and organizes it systematically. Wouldn't that be valuable?

In this article, I'll introduce you to a package I recently discovered: Docling by IBM. It's a powerful solution for parsing documents and converting them into various formats including JSON, Markdown, tables, and plain text.

What is Docling?

As mentioned earlier, Docling is a Python library that parses various document formats - including PDF, DOCX, HTML, Markdown, and PPTX files - and exports their content into structured formats like JSON.

Key Features:

  • Multi-Format Support: Processes multiple input formats and exports to JSON and Markdown
  • Advanced Document Understanding: Interprets page layouts, reading order, and table structures
  • Metadata Extraction: Retrieves document properties including titles, authors, references, and languages
  • Optional OCR Support: Converts scanned documents to machine-readable text
  • Table Extraction: Identifies and exports table structures to CSV or HTML formats

Learn more about Docling

Practical Examples

Let's explore several examples of extracting information from documents using the Docling library.

Use Case 1 - Converting Documents to JSON

For this example, we'll use a PDF from arXiv.org (2410.23335).

Follow these steps:

Install the Docling library:

!pip install docling        

Import and set up document converter:

from docling.document_converter import DocumentConverter
converter = DocumentConverter()        

Next, specify the file path and convert the document:

source = "https://arxiv.org/pdf/2410.23335"
result = converter.convert(source)        

When you run this code for the first time, Docling will download its model artifacts. The initial setup takes approximately 2.5 minutes, but this is a one-time process. Subsequent runs will use the cached models.


Downloading Model Artifacts

Once the model artifacts are loaded, export the document to a dictionary format and dump it into JSON:

import json

result_dict = result.document.export_to_dict()

print(json.dumps(result_dict, indent=2))        

This generates a comprehensive dictionary containing the document's structured data, which can be easily converted to JSON format. The dictionary includes the document's content, structure, and metadata in a hierarchical format.

Formatting the output JSON reveals the document's structured data:


JSON

The structured JSON output can be integrated into your data engineering or analytics pipelines for further processing and analysis.

Use Case 2: Extracting Tables

Docling can identify and extract tables from documents, allowing you to:

  • Convert them to pandas DataFrames
  • Export them to CSV files
  • Save them as HTML tables

Here's how to extract tables from your document:

# Extract Tables from Documents in CSV and HTML Format.
import pandas as pd
from pathlib import Path

# Define output folder path
output_dir = Path("")

result = converter.convert(source)

# Get the filename
doc_filename = result.input.file.stem
print(f"Document filename: {doc_filename}")

# Iterate over tables in the document and save them as CSV and HTML formats.
for table_idx, table in enumerate(result.document.tables):
  table_df: pd.DataFrame = table.export_to_dataframe()
  print(f"$$ Table {table_idx}")

  # Save as CSV
  table_df.to_csv(f"{doc_filename}-table-{table_idx}.csv")

  # Save as HTML
  html_filename = output_dir / f"{doc_filename}-table-{table_idx+1}.html"
  with html_filename.open("w") as fp:
    fp.write(table.export_to_html())        

That’s all we need to do. Simple, right?


ray.so
You can copy my Google Colab Notebook to try Docling yourself.

?? So, how I would build document parsing pipelines? Here’s a 6-step process:

  1. Document Ingestion: Collect documents from various sources, such as local directories, cloud storage, or even web scraping.
  2. Parse with Docling: Use Docling to parse documents and convert them into JSON.
  3. Export Tables: Along with JSON structure, consider exporting tables from documents. They often contain valuable information.
  4. Post-Processing: Clean the extracted data, extract specific fields, transform, normalize, and categorize texts.
  5. Storage: Store the structured data in databases or data lakes for further analysis.
  6. Analysis and Reporting: Utilize the structured data for analytics, reporting, or even machine learning applications.

By following these steps, you can build data parsing pipelines that efficiently processes and analyzes information from diverse document sources.


?? Enjoyed the article? If you have questions or need further clarification, leave a comment below or reach out directly.


? Thank you for reading my article on SA Space! I welcome any questions, comments, or suggestions you may have.

Keep Knowledge Flowing by following me for more content on Solutions Architecture, System Design, Data Engineering, Business Analysis, and more. Your engagement is appreciated.

?? You can also follow my work on LinkedIn Newsletter | Substack | My Resume


Ravin Maharaj

Oracle DBA at Outsourcing Business Solutions

4 个月

Thank you for sharing. Can this tool convert excel spreadsheets to a pdf format ? Are there any licensing implications

回复
Akash V.

................

4 个月

Nice tool anyone can have. Thank you so much for sharing this value addition tool. Hats off Lasha Dolenjashvili ??

要查看或添加评论,请登录

Lasha Dolenjashvili的更多文章