Decoding Documents: A Deep Dive into Azure AI Document Intelligence (Part 2)
</TLDR> In Part 2 of our series on Azure AI Document Intelligence, we delve deeper into enhancing your document processing workflows by leveraging LabellingUX and advanced integration techniques. This segment focuses on refining your custom models and ensuring secure, efficient data handling.
Topics covered
Introduction to LabellingUX
Labelling UX is a powerful, open-source web application designed to facilitate the manual labeling of documents for data extraction purposes. This intuitive tool empowers users to prepare high-quality training datasets for custom models by creating and assigning labels to various elements within documents, such as fields, tables, selection marks, and signatures.
With Labelling UX, users can interactively define key-value pairs, annotate tables with customized column and row names, and specify bounding boxes for text and selection marks. This level of granular control ensures precise labeling, enabling custom models to deliver accurate and reliable results during inference.
Using LabellingUX for Data Labeling:
Labelling UX generates and uses specific files to manage labeled data, ensuring seamless training and analysis of custom models. Here's an overview of the purpose and structure of these files:
fields.json
This file defines the schema for the fields you plan to extract during labeling. It acts as a blueprint for your labels, specifying the names and types of data points (or "keys") you want to annotate.
Key Attributes:
{
"fields": [
{ "name": "InvoiceNumber", "type": "string" },
{ "name": "DueDate", "type": "date", "subtype": "dmy" },
{ "name": "TotalAmount", "type": "number" }
]
}
Purpose :
.pdf.ocr.json
This file contains the results of Optical Character Recognition (OCR) performed on the document. It captures the textual content extracted from the document along with its positional and structural information. This can be extracted from POST prebuilt-layout:analyze API :
{your-resource-endpoint}.cognitiveservices.azure.com/documentintelligence/documentModels/prebuilt-layout:analyze?api-version=2024-02-29-preview&features=ocrHighResolution
For more details you can checkout this article from microsoft
Key Attributes:
{
"pages": [
{
"pageNumber": 1,
"width": 8.5,
"height": 11,
"unit": "inch",
"paragraphs": [
{
"content": "Introduction to Azure AI Document Intelligence",
"boundingRegions": [
{
"pageNumber": 1,
"polygon": [0.1, 0.1, 0.9, 0.1, 0.9, 0.2, 0.1, 0.2]
}
],
"role": "title"
}
],
"words": [
{
"content": "Introduction",
"boundingBox": [0.1, 0.1, 0.3, 0.1, 0.3, 0.15, 0.1, 0.15],
"confidence": 0.99
}
],
"selectionMarks": [
{
"state": "selected",
"boundingBox": [0.1, 0.3, 0.2, 0.3, 0.2, 0.4, 0.1, 0.4],
"confidence": 0.95
}
]
}
]
}
Purpose :
.pdf.labels.json
The .pdf.labels.json file is integral to Azure's Document Intelligence, encapsulating labeled data for a specific document. This file contains the bounding boxes of identified fields along with their associated metadata, linking them to specific regions within the document. It is used in conjunction with .pdf.ocr.json to visualize and refine labeled data, forming the foundation for effective model training in LabellingUX.
Steps to Generate .pdf.labels.json
{your-resource-endpoint}.cognitiveservices.azure.com/documentintelligence/documentModels/{model-name}:analyze?api-version=2024-02-29-preview&stringIndexType=utf16CodeUnit
The output will include polygon coordinates for each extracted field.
Automating Labels with the Document Analysis API
领英推荐
using Newtonsoft.Json.Linq;
using System;
using System.Collections.Generic;
using System.Linq;
public static class LabelsJsonGenerator
{
public static List<List<string>> GetBoundingBoxes(string word, int pageNumber, JObject analysisResult)
{
var boundingBoxList = new List<List<string>>();
// Get the page object by pageNumber
var page = analysisResult["analyzeResult"]["pages"]
.FirstOrDefault(p => (int)p["pageNumber"] == pageNumber);
if (page == null)
{
return new List<List<string>> { boundingBoxList };
}
// Find the word object on the specified page
var wordObject = page["words"]
.FirstOrDefault(w => w["content"]?.ToString() == word);
if (wordObject == null)
{
return new List<List<string>> { boundingBoxList };
}
// Extract polygon and dimensions
var polygon = wordObject["polygon"]?.ToObject<List<double>>();
if (polygon == null || !polygon.Any())
{
return new List<List<string>> { boundingBoxList };
}
double width = page["width"]?.ToObject<double>() ?? 1;
double height = page["height"]?.ToObject<double>() ?? 1;
// Convert polygon coordinates to bounding box
var normalizedBoundingBox = new List<string>();
for (int i = 0; i < polygon.Count; i += 2)
{
double normalizedX = polygon[i] / width;
double normalizedY = polygon[i + 1] / height;
normalizedBoundingBox.Add(normalizedX.ToString("F6")); // Precision up to 6 decimal points
normalizedBoundingBox.Add(normalizedY.ToString("F6"));
}
boundingBoxList.Add(normalizedBoundingBox);
return boundingBoxList;
}
}
Example of this conversion is as follows :
Analysis Result Snippet (Input)
{
"pages": [
{
"pageNumber": 1,
"width": 8.5,
"height": 11,
"words": [
{
"content": "Machine",
"polygon": [100.0, 200.0, 200.0, 200.0, 200.0, 250.0, 100.0, 250.0]
}
]
}
]
}
Converted Bounding Boxes (Output in .pdf.labels.json)
{
"labels": [
{
"label": "Title",
"value": [
{
"boundingBoxes": [
[0.117647, 0.181818, 0.235294, 0.181818, 0.235294, 0.227273, 0.117647, 0.227273]
],
"page": 1,
"text": "Machine"
}
],
"labelType": "Words"
}
]
}
Validating the updated model
Once the required files (fields.json, .pdf.ocr.json, and .pdf.labels.json) are created, they can be uploaded to Azure Blob Storage. A Shared Access Signature (SAS) URL is then generated for the blob container, which is used to train the model.
Training a Custom Model
To train a custom model, use the following POST API call:
Endpoint:
{{documentIntelligenceEndpoint}}formrecognizer/documentModels:build?api-version=2024-02-29-preview
Payload Example:
{
"modelId": "model_nextVersion",
"description": "Description of the model",
"buildMode": "neural",
"azureBlobSource": {
"containerUrl": "{{blobSASURL}}",
"prefix": ""
}
}
Steps:
{{documentIntelligenceEndpoint}}formrecognizer/operations/{operationId}?api-version=2024-02-29-preview
Training Status:
Once the training is complete, the model can be validated against documents to ensure it correctly extracts the intended fields.
Securing LabellingUX: Using MSAL.js for Secure Authentication
Securing your LabellingUX application, with MSAL.js involves integrating Microsoft's Authentication Library to manage user authentication seamlessly. Here's how to implement this:
npm install @azure/msal-browser @azure/msal-react
export const msalConfig = {
auth: {
clientId: 'YOUR_CLIENT_ID',
authority: 'https://login.microsoftonline.com/YOUR_TENANT_ID',
redirectUri: 'https://localhost:3000',
},
};
export const loginRequest = {
scopes: ['user.read'],
};
Replace 'YOUR_CLIENT_ID' and 'YOUR_TENANT_ID' with your actual Azure AD credentials.
import * as React from "react";
import * as ReactDOM from "react-dom";
import { Provider } from "react-redux";
import { ConnectedRouter } from "connected-react-router";
import { history } from "./store/browserHistory";
import App from "./app";
import { store } from "store";
import { initializeIcons } from "@fluentui/react/lib/Icons";
import { PublicClientApplication } from '@azure/msal-browser';
import { MsalProvider } from '@azure/msal-react';
import { msalConfig } from './authConfig';
import "./index.scss";
initializeIcons();
ReactDOM.render(
<MsalProvider instance={msalInstance} store={store}>
<ConnectedRouter history={history}>
<App />
</ConnectedRouter>
</MsalProvider>,
document.getElementById("root")
);
For a comprehensive understanding and additional examples, refer to the official Microsoft documentation: