登录查看更多内容

Decoding Documents: A Deep Dive into Azure AI Document Intelligence

Shoeb Sayyed

PRINCE2 | Web | Mobile | Cloud | Automation

发布日期: 2024年12月22日

Azure AI Document Intelligence is a powerful AI-based OCR extraction tool offered by Microsoft. In this two-part series, I will guide you through using Azure AI Document Intelligence to build and train a custom model for extracting metadata from research papers. For demonstration purposes, we will work on a small project to extract the following metadata fields: Author, Title, Published Date, and Publisher

We will use the free tier version F0 for this demo to minimize costs. In Part 1, we will focus on understanding custom model features, while Part 2 will explore LabellingUX, including document labeling, securing workflows, and improving the model

Topics Covered

Part 1: Building and Validating a Custom Model

Introduction to Azure AI Document Intelligence
Project Overview: What We Are Building
Setting Up the Project
Preparing the Dataset: Creating the Training Set and Validation Set
Training the Custom Model: Uploading Data to Azure Blob Storage and Limiting Page Size
Validating the Model: Extracting Metadata from the Validation Set and Evaluating Performance

Part 2: Enhancing the Project with LabellingUX

Introduction to LabellingUX
Using LabellingUX for Data Labeling: Understanding fields.json, .pdf.ocr.json, and .pdf.labels.json
Automating Labels with the Document Analysis API
Managing and Uploading Labeled Data: Using the Document Intelligence API to Train the Model
Validating the Updated Model
Securing LabellingUX: Using MSAL.js for Secure Authentication

Introduction to Azure AI Document Intelligence

Azure AI Document Intelligence is a cloud-based service that leverages machine learning to extract text, tables, and key-value pairs from documents.

Key Features:

Automated Data Extraction: Quickly retrieves structured data from various document types, minimizing manual data entry.
Custom Model Training: Enables tailored models for structured and unstructured document formats.
Workflow Integration: Seamlessly integrates extracted data into business processes like invoice management.

Project Overview: What We Are Building

In this project, we will create a custom model using Azure AI Document Intelligence to automatically extract metadata from research papers available online. The targeted metadata fields are:

Author
Title
Published Date
Publisher

Setting Up the Project

Navigate to the Azure Portal and search for Document Intelligence. Click Create in the top-right corner.
Choose the F0 (Free Tier) pricing option to keep costs low.
Note down the keys and endpoint, as they will be needed in the next part.
Click Go to Document Intelligence Studio, which opens a new portal.

Select Custom Extraction Model and complete the following steps

Create a new project (e.g., "researchpapers").
Select the latest GA version for the API.
Link your storage account and blob container to store the project’s files.

initial document intelligence project screen after creation

Preparing the Dataset

We will source research papers from the following websites:

领英推荐

5 Best AI Tools for Data Analysts

Blockchain Council 11 个月前

Zero to Deploy: A Guide to Putting Machine Learning…

Iain Brown PhD 1 个月前

How to Build a Robust Data Collection Pipeline for…

Objectways 5 个月前

Training Set:

Select a small subset of documents for training purposes.

Validation Set:

Set aside another subset of documents to test the model.

Training the Custom Model

Run Layout Analysis

Open Document Intelligence Studio and select your unanalyzed documents.
Click Run Layout to perform OCR on the documents.
This process extracts text, tables, and polygons for bounding recognized elements.
A yellow background indicates that OCR is complete.

Labeling the Documents

Add metadata fields, and for Published Date, set the subtype to "date" with the desired format (e.g., dd/mm/yyyy).

Training the Model

Click Train in the Document Intelligence Studio.
Enter a Model ID and a brief description.
Set the Build Mode to Neural.
Click Train to start the training process.

Note: Training takes time and incurs costs. It's advisable to train after labeling all necessary documents. Check the training progress in the Models section.

Validating the Model

Navigate to the Test section in Document Intelligence Studio.
Upload the validation set and click Run Analysis.
Review the results, including confidence scores for extracted metadata.

Evaluation

The output JSON contains:

OCR content of the document.
Model version and metadata predictions.
Bounding regions for extracted fields.

要查看或添加评论，请登录

Shoeb Sayyed的更多文章

Decoding Documents: A Deep Dive into Azure AI Document Intelligence (Part 2)

2025年1月6日

Decoding Documents: A Deep Dive into Azure AI Document Intelligence (Part 2)

In Part 2 of our series on Azure AI Document Intelligence, we delve deeper into enhancing your document processing…
My experience with Certified Kubernetes Application Developer Exam (CKAD)

2020年12月22日

My experience with Certified Kubernetes Application Developer Exam (CKAD)

In order to better understand Kubernetes it is important that one understands the following terms : Images Containers…

8 条评论
Where are we heading towards with advancements in cloud technologies ?

2019年2月26日

Where are we heading towards with advancements in cloud technologies ?

Cloud technologies has changed the way we think about developing software applications.Gone are the days when we would…
Some basic facts of inner workings in .Net

2017年2月6日

Some basic facts of inner workings in .Net

The SOS debugger extension was introduced with .Net 1.
Strategy pattern with dictionary

2016年5月12日

Strategy pattern with dictionary

Strategy Pattern with Dictionary What is a good way to avoid if else statements ? The answer is simple, use switch case…

6 条评论
Why we need an interface ?

2015年7月30日

Why we need an interface ?

Consider the following situation: You are walking in the park and suddenly some wild animal attacks you You have no…

1 条评论

See all articles

Decoding Documents: A Deep Dive into Azure AI Document Intelligence

Shoeb Sayyed

PRINCE2 | Web | Mobile | Cloud | Automation

Topics Covered

Part 1: Building and Validating a Custom Model

Part 2: Enhancing the Project with LabellingUX

Introduction to Azure AI Document Intelligence

Project Overview: What We Are Building

Setting Up the Project

Preparing the Dataset

领英推荐

Training the Custom Model

Training the Model

Validating the Model

Shoeb Sayyed的更多文章

社区洞察

其他会员也浏览了

The Hidden Challenges of Data Sourcing for Machine Learning Models

Comparison between OpenAI and OCI Gen AI Services - Pricing, Data Security, and Model Diversity

Transforming ML practices with Azure Gen AI Cloud: 10 Business cases.

Unifying Enterprise Data for Generative AI

Using Prebuilt Models from Databricks Marketplace for Quick Deployment

How to Leverage Embeddings for Data Curation in Computer Vision

AI comes to the Database at the core of your Data

A Closer Look at the Major Players GenAI Stack

Machine Learning Recommendation Systems and Azure ML: A Comprehensive Guide

10 Trusted AI Tools For Data Analysis

Topics Covered

Part 1: Building and Validating a Custom Model

Part 2: Enhancing the Project with LabellingUX

Introduction to Azure AI Document Intelligence

Project Overview: What We Are Building

Setting Up the Project

Preparing the Dataset

领英推荐

Training the Custom Model

Training the Model

Validating the Model

Shoeb Sayyed的更多文章