登录查看更多内容

Decoding Documents: A Deep Dive into Azure AI Document Intelligence (Part 2)

Shoeb Sayyed

PRINCE2 | Web | Mobile | Cloud | Automation

发布日期: 2025年1月6日

</TLDR> In Part 2 of our series on Azure AI Document Intelligence, we delve deeper into enhancing your document processing workflows by leveraging LabellingUX and advanced integration techniques. This segment focuses on refining your custom models and ensuring secure, efficient data handling.

Topics covered

Introduction to LabellingUX
Using LabellingUX for Data Labeling: Understanding fields.json, .pdf.ocr.json, and .pdf.labels.json
Automating Labels with the Document Analysis API
Managing and Uploading Labeled Data: Using the Document Intelligence API to Train the Model
Validating the Updated Model
Securing LabellingUX: Using MSAL.js for Secure Authentication
Conclusion: Building Scalable and Secure Workflows

Introduction to LabellingUX

Labelling UX is a powerful, open-source web application designed to facilitate the manual labeling of documents for data extraction purposes. This intuitive tool empowers users to prepare high-quality training datasets for custom models by creating and assigning labels to various elements within documents, such as fields, tables, selection marks, and signatures.

With Labelling UX, users can interactively define key-value pairs, annotate tables with customized column and row names, and specify bounding boxes for text and selection marks. This level of granular control ensures precise labeling, enabling custom models to deliver accurate and reliable results during inference.

Using LabellingUX for Data Labeling:

Labelling UX generates and uses specific files to manage labeled data, ensuring seamless training and analysis of custom models. Here's an overview of the purpose and structure of these files:

fields.json

This file defines the schema for the fields you plan to extract during labeling. It acts as a blueprint for your labels, specifying the names and types of data points (or "keys") you want to annotate.

Key Attributes:

Field Names: Specifies the labels or keys (e.g., "Invoice Number," "Due Date").
Data Types: Defines the data type for each field, such as:string (for text fields), number (for numeric fields), date (for date fields) etc.
Field Metadata: Provides additional information, such as constraints or subtypes.

{
  "fields": [
    { "name": "InvoiceNumber", "type": "string" },
    { "name": "DueDate", "type": "date", "subtype": "dmy" },
    { "name": "TotalAmount", "type": "number" }
  ]
}

Purpose :

Used as a reference for labeling documents, ensuring consistency across annotations.
Guides the model during training to recognize and extract the defined fields.

.pdf.ocr.json

This file contains the results of Optical Character Recognition (OCR) performed on the document. It captures the textual content extracted from the document along with its positional and structural information. This can be extracted from POST prebuilt-layout:analyze API :

{your-resource-endpoint}.cognitiveservices.azure.com/documentintelligence/documentModels/prebuilt-layout:analyze?api-version=2024-02-29-preview&features=ocrHighResolution

For more details you can checkout this article from microsoft

Key Attributes:

Pages: An array where each entry corresponds to a page in the PDF, detailing its dimensions and orientation.
Paragraphs: Blocks of text identified within the document, each accompanied by its content, bounding regions, and, if applicable, a designated role (e.g., title, section heading).
Words: Individual words extracted from the document, each with its content and confidence score.
Selection Marks: Detected checkboxes or similar elements, indicating their state (selected or unselected) and positional data.

{
  "pages": [
    {
      "pageNumber": 1,
      "width": 8.5,
      "height": 11,
      "unit": "inch",
      "paragraphs": [
        {
          "content": "Introduction to Azure AI Document Intelligence",
          "boundingRegions": [
            {
              "pageNumber": 1,
              "polygon": [0.1, 0.1, 0.9, 0.1, 0.9, 0.2, 0.1, 0.2]
            }
          ],
          "role": "title"
        }
      ],
      "words": [
        {
          "content": "Introduction",
          "boundingBox": [0.1, 0.1, 0.3, 0.1, 0.3, 0.15, 0.1, 0.15],
          "confidence": 0.99
        }
      ],
      "selectionMarks": [
        {
          "state": "selected",
          "boundingBox": [0.1, 0.3, 0.2, 0.3, 0.2, 0.4, 0.1, 0.4],
          "confidence": 0.95
        }
      ]
    }
  ]
}

Purpose :

Serves as the foundation for labeling by mapping the extracted text to its physical location in the document.
Enables visualization of text elements (e.g., yellow background for text, pink borders for selection marks) in Labelling UX.

.pdf.labels.json

The .pdf.labels.json file is integral to Azure's Document Intelligence, encapsulating labeled data for a specific document. This file contains the bounding boxes of identified fields along with their associated metadata, linking them to specific regions within the document. It is used in conjunction with .pdf.ocr.json to visualize and refine labeled data, forming the foundation for effective model training in LabellingUX.

Steps to Generate .pdf.labels.json

Call the Document Intelligence Analysis API: Use the POST Analyze API to extract the content of the document. The API returns a JSON response with detected elements, including bounding polygons for text, tables, and other structures.

{your-resource-endpoint}.cognitiveservices.azure.com/documentintelligence/documentModels/{model-name}:analyze?api-version=2024-02-29-preview&stringIndexType=utf16CodeUnit

The output will include polygon coordinates for each extracted field.

Automating Labels with the Document Analysis API

Convert Polygon Coordinates to Bounding Boxes: Use the dimensions of the document's pages (width and height) to normalize the coordinates. This ensures compatibility with LabellingUX, where the bounding boxes are scaled relative to the document dimensions.

领英推荐

Artificial Intelligence #31 - Understanding the AI…

Ajit Jaokar 3 年前

Evaluating ML Models with Azure, Preventing AI…

Open Data Science Conference (ODSC) 2 年前

Domain Specific Generative AI Offerings in AWS…

Amit Singh 1 年前

using Newtonsoft.Json.Linq;
using System;
using System.Collections.Generic;
using System.Linq;

public static class LabelsJsonGenerator
{
    public static List<List<string>> GetBoundingBoxes(string word, int pageNumber, JObject analysisResult)
    {
        var boundingBoxList = new List<List<string>>();

        // Get the page object by pageNumber
        var page = analysisResult["analyzeResult"]["pages"]
            .FirstOrDefault(p => (int)p["pageNumber"] == pageNumber);

        if (page == null)
        {
            return new List<List<string>> { boundingBoxList };
        }

        // Find the word object on the specified page
        var wordObject = page["words"]
            .FirstOrDefault(w => w["content"]?.ToString() == word);

        if (wordObject == null)
        {
            return new List<List<string>> { boundingBoxList };
        }

        // Extract polygon and dimensions
        var polygon = wordObject["polygon"]?.ToObject<List<double>>();
        if (polygon == null || !polygon.Any())
        {
            return new List<List<string>> { boundingBoxList };
        }

        double width = page["width"]?.ToObject<double>() ?? 1;
        double height = page["height"]?.ToObject<double>() ?? 1;

        // Convert polygon coordinates to bounding box
        var normalizedBoundingBox = new List<string>();
        for (int i = 0; i < polygon.Count; i += 2)
        {
            double normalizedX = polygon[i] / width;
            double normalizedY = polygon[i + 1] / height;

            normalizedBoundingBox.Add(normalizedX.ToString("F6")); // Precision up to 6 decimal points
            normalizedBoundingBox.Add(normalizedY.ToString("F6"));
        }

        boundingBoxList.Add(normalizedBoundingBox);
        return boundingBoxList;
    }
}

Example of this conversion is as follows :

Analysis Result Snippet (Input)

{
  "pages": [
    {
      "pageNumber": 1,
      "width": 8.5,
      "height": 11,
      "words": [
        {
          "content": "Machine",
          "polygon": [100.0, 200.0, 200.0, 200.0, 200.0, 250.0, 100.0, 250.0]
        }
      ]
    }
  ]
}

Converted Bounding Boxes (Output in .pdf.labels.json)

{
  "labels": [
    {
      "label": "Title",
      "value": [
        {
          "boundingBoxes": [
            [0.117647, 0.181818, 0.235294, 0.181818, 0.235294, 0.227273, 0.117647, 0.227273]
          ],
          "page": 1,
          "text": "Machine"
        }
      ],
      "labelType": "Words"
    }
  ]
}

Save the .pdf.labels.json File: Save the file using the naming convention <original file name>.labels.json. It should reside in the same directory as the document, .ocr.json, and fields.json.

Validating the updated model

Once the required files (fields.json, .pdf.ocr.json, and .pdf.labels.json) are created, they can be uploaded to Azure Blob Storage. A Shared Access Signature (SAS) URL is then generated for the blob container, which is used to train the model.

Training a Custom Model

To train a custom model, use the following POST API call:

Endpoint:

{{documentIntelligenceEndpoint}}formrecognizer/documentModels:build?api-version=2024-02-29-preview

Payload Example:

{
    "modelId": "model_nextVersion",
    "description": "Description of the model",
    "buildMode": "neural",
    "azureBlobSource": {
        "containerUrl": "{{blobSASURL}}",
        "prefix": ""
    }
}

Steps:

Submit the API request with the SAS URL pointing to the Azure Blob container containing the training files.
The response includes an operation-location in the headers

{{documentIntelligenceEndpoint}}formrecognizer/operations/{operationId}?api-version=2024-02-29-preview

Use the operation-location URL to check the training status by making a GET request.

Training Status:

running: Indicates the model is still being trained.
succeeded: Indicates the model has been successfully trained and is ready for use.

Once the training is complete, the model can be validated against documents to ensure it correctly extracts the intended fields.

Securing LabellingUX: Using MSAL.js for Secure Authentication

Securing your LabellingUX application, with MSAL.js involves integrating Microsoft's Authentication Library to manage user authentication seamlessly. Here's how to implement this:

Install MSAL Packages: Begin by installing the necessary MSAL packages. Execute the following command in your project directory:

npm install @azure/msal-browser @azure/msal-react

Configure MSAL: Create a configuration file, typically named authConfig.js, and populate it with your Azure Active Directory (AAD) credentials:

export const msalConfig = {
  auth: {
    clientId: 'YOUR_CLIENT_ID',
    authority: 'https://login.microsoftonline.com/YOUR_TENANT_ID',
    redirectUri: 'https://localhost:3000',
  },
};

export const loginRequest = {
  scopes: ['user.read'],
};

Replace 'YOUR_CLIENT_ID' and 'YOUR_TENANT_ID' with your actual Azure AD credentials.

Initialize MSAL in Your Application: In your index.tsx, import the necessary MSAL components and wrap your application with the MsalProvider:

import * as React from "react";
import * as ReactDOM from "react-dom";
import { Provider } from "react-redux";
import { ConnectedRouter } from "connected-react-router";
import { history } from "./store/browserHistory";
import App from "./app";
import { store } from "store";
import { initializeIcons } from "@fluentui/react/lib/Icons";
import { PublicClientApplication } from '@azure/msal-browser';
import { MsalProvider } from '@azure/msal-react';
import { msalConfig } from './authConfig';

import "./index.scss";

initializeIcons();
ReactDOM.render(
    <MsalProvider instance={msalInstance} store={store}>
        <ConnectedRouter history={history}>
            <App />
        </ConnectedRouter>
    </MsalProvider>,
    document.getElementById("root")
);

For a comprehensive understanding and additional examples, refer to the official Microsoft documentation:

要查看或添加评论，请登录

Shoeb Sayyed的更多文章

Decoding Documents: A Deep Dive into Azure AI Document Intelligence

2024年12月22日

Decoding Documents: A Deep Dive into Azure AI Document Intelligence

Azure AI Document Intelligence is a powerful AI-based OCR extraction tool offered by Microsoft. In this two-part…

1 条评论
My experience with Certified Kubernetes Application Developer Exam (CKAD)

2020年12月22日

My experience with Certified Kubernetes Application Developer Exam (CKAD)

In order to better understand Kubernetes it is important that one understands the following terms : Images Containers…

8 条评论
Where are we heading towards with advancements in cloud technologies ?

2019年2月26日

Where are we heading towards with advancements in cloud technologies ?

Cloud technologies has changed the way we think about developing software applications.Gone are the days when we would…
Some basic facts of inner workings in .Net

2017年2月6日

Some basic facts of inner workings in .Net

The SOS debugger extension was introduced with .Net 1.
Strategy pattern with dictionary

2016年5月12日

Strategy pattern with dictionary

Strategy Pattern with Dictionary What is a good way to avoid if else statements ? The answer is simple, use switch case…

6 条评论
Why we need an interface ?

2015年7月30日

Why we need an interface ?

Consider the following situation: You are walking in the park and suddenly some wild animal attacks you You have no…

1 条评论

See all articles

Decoding Documents: A Deep Dive into Azure AI Document Intelligence (Part 2)

Shoeb Sayyed

PRINCE2 | Web | Mobile | Cloud | Automation

Topics covered

Introduction to LabellingUX

Using LabellingUX for Data Labeling:

fields.json

.pdf.ocr.json

.pdf.labels.json

Automating Labels with the Document Analysis API

领英推荐

Validating the updated model

Training a Custom Model

Securing LabellingUX: Using MSAL.js for Secure Authentication

Shoeb Sayyed的更多文章

社区洞察

其他会员也浏览了

Demystifying LLM Customization for the Enterprise

Azure AI Agents vs. AWS AI Agents vs. Google Vertex AI Agent Builder

Microsoft Copilot… or Azure?AI?!

Foundation model debate: Choices, small vs. large, commoditization

Google Cloud Platform Generative AI Services

Powered By… Edition 7: The Technology Powering AI Agent Prototypes

Machine Learning Recommendation Systems and Azure ML: A Comprehensive Guide

Crafting the Golden Age of AI: Keynotes from Microsoft Build 2024

Build Your Intelligent Custom Application Development With Azure AI

Unleashing Possibilities: How Generative AI Impacts Enterprise SaaS

Topics covered

Introduction to LabellingUX

Using LabellingUX for Data Labeling:

fields.json

.pdf.ocr.json

.pdf.labels.json

Automating Labels with the Document Analysis API

领英推荐

Validating the updated model

Training a Custom Model

Securing LabellingUX: Using MSAL.js for Secure Authentication

Shoeb Sayyed的更多文章