Building Document Parsing Pipelines with Python
Lasha Dolenjashvili
Data Solutions Architect @ Bank of Georgia | IIBA? Certified Business Analyst | Open to Freelance, Remote, or Relocation Opportunities
Why Parse Documents?
Recently, I've been working with various document parsing challenges at work. Our systems generate many types of documents daily - text files, Word documents, Excel spreadsheets, and PDFs. These documents often contain valuable information that can improve data-driven decision-making process.
Building automated document parsing pipelines can streamline the extraction of important information from countless files. These pipelines are systems that automatically extract and process document data, transforming it into usable formats.
Imagine having a tool that reads through all your files, identifies critical information, and organizes it systematically. Wouldn't that be valuable?
In this article, I'll introduce you to a package I recently discovered: Docling by IBM. It's a powerful solution for parsing documents and converting them into various formats including JSON, Markdown, tables, and plain text.
What is Docling?
As mentioned earlier, Docling is a Python library that parses various document formats - including PDF, DOCX, HTML, Markdown, and PPTX files - and exports their content into structured formats like JSON.
Key Features:
Practical Examples
Let's explore several examples of extracting information from documents using the Docling library.
Use Case 1 - Converting Documents to JSON
For this example, we'll use a PDF from arXiv.org (2410.23335).
Follow these steps:
Install the Docling library:
!pip install docling
Import and set up document converter:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
Next, specify the file path and convert the document:
source = "https://arxiv.org/pdf/2410.23335"
result = converter.convert(source)
When you run this code for the first time, Docling will download its model artifacts. The initial setup takes approximately 2.5 minutes, but this is a one-time process. Subsequent runs will use the cached models.
Once the model artifacts are loaded, export the document to a dictionary format and dump it into JSON:
import json
result_dict = result.document.export_to_dict()
print(json.dumps(result_dict, indent=2))
This generates a comprehensive dictionary containing the document's structured data, which can be easily converted to JSON format. The dictionary includes the document's content, structure, and metadata in a hierarchical format.
Formatting the output JSON reveals the document's structured data:
The structured JSON output can be integrated into your data engineering or analytics pipelines for further processing and analysis.
Use Case 2: Extracting Tables
Docling can identify and extract tables from documents, allowing you to:
Here's how to extract tables from your document:
# Extract Tables from Documents in CSV and HTML Format.
import pandas as pd
from pathlib import Path
# Define output folder path
output_dir = Path("")
result = converter.convert(source)
# Get the filename
doc_filename = result.input.file.stem
print(f"Document filename: {doc_filename}")
# Iterate over tables in the document and save them as CSV and HTML formats.
for table_idx, table in enumerate(result.document.tables):
table_df: pd.DataFrame = table.export_to_dataframe()
print(f"$$ Table {table_idx}")
# Save as CSV
table_df.to_csv(f"{doc_filename}-table-{table_idx}.csv")
# Save as HTML
html_filename = output_dir / f"{doc_filename}-table-{table_idx+1}.html"
with html_filename.open("w") as fp:
fp.write(table.export_to_html())
That’s all we need to do. Simple, right?
You can copy my Google Colab Notebook to try Docling yourself.
?? So, how I would build document parsing pipelines? Here’s a 6-step process:
By following these steps, you can build data parsing pipelines that efficiently processes and analyzes information from diverse document sources.
?? Enjoyed the article? If you have questions or need further clarification, leave a comment below or reach out directly.
? Thank you for reading my article on SA Space! I welcome any questions, comments, or suggestions you may have.
Keep Knowledge Flowing by following me for more content on Solutions Architecture, System Design, Data Engineering, Business Analysis, and more. Your engagement is appreciated.
?? You can also follow my work on LinkedIn Newsletter | Substack | My Resume
Oracle DBA at Outsourcing Business Solutions
4 个月Thank you for sharing. Can this tool convert excel spreadsheets to a pdf format ? Are there any licensing implications
................
4 个月Nice tool anyone can have. Thank you so much for sharing this value addition tool. Hats off Lasha Dolenjashvili ??