Decoding Documents: A Deep Dive into Azure AI Document Intelligence

Decoding Documents: A Deep Dive into Azure AI Document Intelligence

Azure AI Document Intelligence is a powerful AI-based OCR extraction tool offered by Microsoft. In this two-part series, I will guide you through using Azure AI Document Intelligence to build and train a custom model for extracting metadata from research papers. For demonstration purposes, we will work on a small project to extract the following metadata fields: Author, Title, Published Date, and Publisher

We will use the free tier version F0 for this demo to minimize costs. In Part 1, we will focus on understanding custom model features, while Part 2 will explore LabellingUX, including document labeling, securing workflows, and improving the model

Topics Covered

Part 1: Building and Validating a Custom Model

  1. Introduction to Azure AI Document Intelligence
  2. Project Overview: What We Are Building
  3. Setting Up the Project
  4. Preparing the Dataset: Creating the Training Set and Validation Set
  5. Training the Custom Model: Uploading Data to Azure Blob Storage and Limiting Page Size
  6. Validating the Model: Extracting Metadata from the Validation Set and Evaluating Performance

Part 2: Enhancing the Project with LabellingUX

  1. Introduction to LabellingUX
  2. Using LabellingUX for Data Labeling: Understanding fields.json, .pdf.ocr.json, and .pdf.labels.json
  3. Automating Labels with the Document Analysis API
  4. Managing and Uploading Labeled Data: Using the Document Intelligence API to Train the Model
  5. Validating the Updated Model
  6. Securing LabellingUX: Using MSAL.js for Secure Authentication


Introduction to Azure AI Document Intelligence

Azure AI Document Intelligence is a cloud-based service that leverages machine learning to extract text, tables, and key-value pairs from documents.

Key Features:

  • Automated Data Extraction: Quickly retrieves structured data from various document types, minimizing manual data entry.
  • Custom Model Training: Enables tailored models for structured and unstructured document formats.
  • Workflow Integration: Seamlessly integrates extracted data into business processes like invoice management.


Project Overview: What We Are Building

In this project, we will create a custom model using Azure AI Document Intelligence to automatically extract metadata from research papers available online. The targeted metadata fields are:

  1. Author
  2. Title
  3. Published Date
  4. Publisher


Setting Up the Project


document intelligence resource

  • Navigate to the Azure Portal and search for Document Intelligence. Click Create in the top-right corner.
  • Choose the F0 (Free Tier) pricing option to keep costs low.
  • Note down the keys and endpoint, as they will be needed in the next part.
  • Click Go to Document Intelligence Studio, which opens a new portal.


Select Custom Extraction Model and complete the following steps

  • Create a new project (e.g., "researchpapers").
  • Select the latest GA version for the API.
  • Link your storage account and blob container to store the project’s files.


initial document intelligence project screen after creation


fields to capture metadata

Preparing the Dataset

We will source research papers from the following websites:

Training Set:

Select a small subset of documents for training purposes.

Validation Set:

Set aside another subset of documents to test the model.

Training the Custom Model

Run Layout Analysis

  1. Open Document Intelligence Studio and select your unanalyzed documents.
  2. Click Run Layout to perform OCR on the documents.
  3. This process extracts text, tables, and polygons for bounding recognized elements.
  4. A yellow background indicates that OCR is complete.


Labeling the Documents

Add metadata fields, and for Published Date, set the subtype to "date" with the desired format (e.g., dd/mm/yyyy).


adding sub-type


Training the Model

  1. Click Train in the Document Intelligence Studio.
  2. Enter a Model ID and a brief description.
  3. Set the Build Mode to Neural.
  4. Click Train to start the training process.

training a model
Note: Training takes time and incurs costs. It's advisable to train after labeling all necessary documents. Check the training progress in the Models section.


training model



Validating the Model

  1. Navigate to the Test section in Document Intelligence Studio.
  2. Upload the validation set and click Run Analysis.
  3. Review the results, including confidence scores for extracted metadata.


validation result

Evaluation

The output JSON contains:

  • OCR content of the document.
  • Model version and metadata predictions.
  • Bounding regions for extracted fields.

analysis result


要查看或添加评论,请登录

Shoeb Sayyed的更多文章

社区洞察

其他会员也浏览了