登录查看更多内容

Building Document Parsing Pipelines with Python

Lasha Dolenjashvili

Data Solutions Architect @ Bank of Georgia | IIBA? Certified Business Analyst | Open to Freelance, Remote, or Relocation Opportunities

发布日期: 2024年11月3日

Why Parse Documents?

Recently, I've been working with various document parsing challenges at work. Our systems generate many types of documents daily - text files, Word documents, Excel spreadsheets, and PDFs. These documents often contain valuable information that can improve data-driven decision-making process.

Building automated document parsing pipelines can streamline the extraction of important information from countless files. These pipelines are systems that automatically extract and process document data, transforming it into usable formats.

Imagine having a tool that reads through all your files, identifies critical information, and organizes it systematically. Wouldn't that be valuable?

In this article, I'll introduce you to a package I recently discovered: Docling by IBM. It's a powerful solution for parsing documents and converting them into various formats including JSON, Markdown, tables, and plain text.

What is Docling?

As mentioned earlier, Docling is a Python library that parses various document formats - including PDF, DOCX, HTML, Markdown, and PPTX files - and exports their content into structured formats like JSON.

Key Features:

Multi-Format Support: Processes multiple input formats and exports to JSON and Markdown
Advanced Document Understanding: Interprets page layouts, reading order, and table structures
Metadata Extraction: Retrieves document properties including titles, authors, references, and languages
Optional OCR Support: Converts scanned documents to machine-readable text
Table Extraction: Identifies and exports table structures to CSV or HTML formats

Learn more about Docling

Practical Examples

Let's explore several examples of extracting information from documents using the Docling library.

Use Case 1 - Converting Documents to JSON

For this example, we'll use a PDF from arXiv.org (2410.23335).

Follow these steps:

Install the Docling library:

!pip install docling

Import and set up document converter:

from docling.document_converter import DocumentConverter
converter = DocumentConverter()

Next, specify the file path and convert the document:

source = "https://arxiv.org/pdf/2410.23335"
result = converter.convert(source)

When you run this code for the first time, Docling will download its model artifacts. The initial setup takes approximately 2.5 minutes, but this is a one-time process. Subsequent runs will use the cached models.

Once the model artifacts are loaded, export the document to a dictionary format and dump it into JSON:

import json

result_dict = result.document.export_to_dict()

print(json.dumps(result_dict, indent=2))

This generates a comprehensive dictionary containing the document's structured data, which can be easily converted to JSON format. The dictionary includes the document's content, structure, and metadata in a hierarchical format.

Formatting the output JSON reveals the document's structured data:

The structured JSON output can be integrated into your data engineering or analytics pipelines for further processing and analysis.

Use Case 2: Extracting Tables

Docling can identify and extract tables from documents, allowing you to:

Convert them to pandas DataFrames
Export them to CSV files
Save them as HTML tables

Here's how to extract tables from your document:

# Extract Tables from Documents in CSV and HTML Format.
import pandas as pd
from pathlib import Path

# Define output folder path
output_dir = Path("")

result = converter.convert(source)

# Get the filename
doc_filename = result.input.file.stem
print(f"Document filename: {doc_filename}")

# Iterate over tables in the document and save them as CSV and HTML formats.
for table_idx, table in enumerate(result.document.tables):
  table_df: pd.DataFrame = table.export_to_dataframe()
  print(f"$$ Table {table_idx}")

  # Save as CSV
  table_df.to_csv(f"{doc_filename}-table-{table_idx}.csv")

  # Save as HTML
  html_filename = output_dir / f"{doc_filename}-table-{table_idx+1}.html"
  with html_filename.open("w") as fp:
    fp.write(table.export_to_html())

That’s all we need to do. Simple, right?

You can copy my Google Colab Notebook to try Docling yourself.

?? So, how I would build document parsing pipelines? Here’s a 6-step process:

Document Ingestion: Collect documents from various sources, such as local directories, cloud storage, or even web scraping.
Parse with Docling: Use Docling to parse documents and convert them into JSON.
Export Tables: Along with JSON structure, consider exporting tables from documents. They often contain valuable information.
Post-Processing: Clean the extracted data, extract specific fields, transform, normalize, and categorize texts.
Storage: Store the structured data in databases or data lakes for further analysis.
Analysis and Reporting: Utilize the structured data for analytics, reporting, or even machine learning applications.

By following these steps, you can build data parsing pipelines that efficiently processes and analyzes information from diverse document sources.

?? Enjoyed the article? If you have questions or need further clarification, leave a comment below or reach out directly.

? Thank you for reading my article on SA Space! I welcome any questions, comments, or suggestions you may have.

Keep Knowledge Flowing by following me for more content on Solutions Architecture, System Design, Data Engineering, Business Analysis, and more. Your engagement is appreciated.

?? You can also follow my work on LinkedIn Newsletter | Substack | My Resume

SA Space

1,046 位关注者

Ravin Maharaj

Oracle DBA at Outsourcing Business Solutions

4 个月

Thank you for sharing. Can this tool convert excel spreadsheets to a pdf format ? Are there any licensing implications

Akash V.

................

4 个月

Nice tool anyone can have. Thank you so much for sharing this value addition tool. Hats off Lasha Dolenjashvili ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Lasha Dolenjashvili的更多文章

Printing Your Machine's Specifications with?Python

2024年11月18日

Printing Your Machine's Specifications with?Python

Introduction Have you ever needed to check your machine’s specs quickly, perhaps before running a resource-intensive…
Excel Isn't Going Anywhere, So Let's Automate Parsing?It

2024年11月11日

Excel Isn't Going Anywhere, So Let's Automate Parsing?It

Introduction Excel remains a significant part of our work lives. Despite modern tools and technologies, we frequently…

1 条评论
Introduction to Network Analysis with Neo4j, AuraDB, and Python ???

2024年10月27日

Introduction to Network Analysis with Neo4j, AuraDB, and Python ???

Learning Objectives of the Article: Introduction to Network Analysis Introduction to Neo4j and AuraDB Generating…
Generating 1 Billion Rows of Complex Synthetic Data ??

2024年10月25日

Generating 1 Billion Rows of Complex Synthetic Data ??

Recently, my coworker introduced me to a library for generating synthetic data - dbldatagen. What is it? "The…

2 条评论
Gaps & Islands: Number of Consecutive Days in SQL

2023年10月8日

Gaps & Islands: Number of Consecutive Days in SQL

In my latest post about Gaps & Islands problem, I promised to provide a real-world example that involves finding the…

2 条评论
Exploring SQL without Window Functions (Part II) - Examples

2023年10月7日

Exploring SQL without Window Functions (Part II) - Examples

Today we will explore different techniques of how we might approach common SQL challenges, but without Window…
SQL’s EXISTS and NOT EXISTS: A Comprehensive Guide

2023年10月5日

SQL’s EXISTS and NOT EXISTS: A Comprehensive Guide

Let's learn about two powerful SQL constructs: and . What are EXISTS and NOT EXISTS? The clause is used to test for the…

2 条评论
Exploring SQL without Window Functions (Part I)

2023年10月4日

Exploring SQL without Window Functions (Part I)

What if SQL did not have window functions? Can we even imagine such a world? Window functions are, after all, the heart…
SQL’s Order of Execution

2023年10月2日

SQL’s Order of Execution

In the world of data, SQL (Structured Query Language) is a widely used tool. It’s a language that is crucial for…

5 条评论

See all articles

Why Parse Documents?

What is Docling?

Practical Examples

Use Case 1 - Converting Documents to JSON

Use Case 2: Extracting Tables

SA Space

1,046 位关注者

Lasha Dolenjashvili的更多文章

Printing Your Machine's Specifications with?Python

Excel Isn't Going Anywhere, So Let's Automate Parsing?It

Introduction to Network Analysis with Neo4j, AuraDB, and Python ???

Generating 1 Billion Rows of Complex Synthetic Data ??

Gaps & Islands: Number of Consecutive Days in SQL

Exploring SQL without Window Functions (Part II) - Examples

SQL’s EXISTS and NOT EXISTS: A Comprehensive Guide

Exploring SQL without Window Functions (Part I)

SQL’s Order of Execution