Decoding Documents: A Deep Dive into Azure AI Document Intelligence (Part 2)

Decoding Documents: A Deep Dive into Azure AI Document Intelligence (Part 2)


</TLDR> In Part 2 of our series on Azure AI Document Intelligence, we delve deeper into enhancing your document processing workflows by leveraging LabellingUX and advanced integration techniques. This segment focuses on refining your custom models and ensuring secure, efficient data handling.

Topics covered

  1. Introduction to LabellingUX
  2. Using LabellingUX for Data Labeling: Understanding fields.json, .pdf.ocr.json, and .pdf.labels.json
  3. Automating Labels with the Document Analysis API
  4. Managing and Uploading Labeled Data: Using the Document Intelligence API to Train the Model
  5. Validating the Updated Model
  6. Securing LabellingUX: Using MSAL.js for Secure Authentication
  7. Conclusion: Building Scalable and Secure Workflows


Introduction to LabellingUX

Form-Recognizer-Toolkit (Github)

Labelling UX is a powerful, open-source web application designed to facilitate the manual labeling of documents for data extraction purposes. This intuitive tool empowers users to prepare high-quality training datasets for custom models by creating and assigning labels to various elements within documents, such as fields, tables, selection marks, and signatures.

With Labelling UX, users can interactively define key-value pairs, annotate tables with customized column and row names, and specify bounding boxes for text and selection marks. This level of granular control ensures precise labeling, enabling custom models to deliver accurate and reliable results during inference.

Using LabellingUX for Data Labeling:

Labelling UX generates and uses specific files to manage labeled data, ensuring seamless training and analysis of custom models. Here's an overview of the purpose and structure of these files:

fields.json

This file defines the schema for the fields you plan to extract during labeling. It acts as a blueprint for your labels, specifying the names and types of data points (or "keys") you want to annotate.

Key Attributes:

  • Field Names: Specifies the labels or keys (e.g., "Invoice Number," "Due Date").
  • Data Types: Defines the data type for each field, such as:string (for text fields), number (for numeric fields), date (for date fields) etc.
  • Field Metadata: Provides additional information, such as constraints or subtypes.

{
  "fields": [
    { "name": "InvoiceNumber", "type": "string" },
    { "name": "DueDate", "type": "date", "subtype": "dmy" },
    { "name": "TotalAmount", "type": "number" }
  ]
}        

Purpose :

  • Used as a reference for labeling documents, ensuring consistency across annotations.
  • Guides the model during training to recognize and extract the defined fields.


.pdf.ocr.json

This file contains the results of Optical Character Recognition (OCR) performed on the document. It captures the textual content extracted from the document along with its positional and structural information. This can be extracted from POST prebuilt-layout:analyze API :

{your-resource-endpoint}.cognitiveservices.azure.com/documentintelligence/documentModels/prebuilt-layout:analyze?api-version=2024-02-29-preview&features=ocrHighResolution        

For more details you can checkout this article from microsoft

Key Attributes:

  • Pages: An array where each entry corresponds to a page in the PDF, detailing its dimensions and orientation.
  • Paragraphs: Blocks of text identified within the document, each accompanied by its content, bounding regions, and, if applicable, a designated role (e.g., title, section heading).
  • Words: Individual words extracted from the document, each with its content and confidence score.
  • Selection Marks: Detected checkboxes or similar elements, indicating their state (selected or unselected) and positional data.

{
  "pages": [
    {
      "pageNumber": 1,
      "width": 8.5,
      "height": 11,
      "unit": "inch",
      "paragraphs": [
        {
          "content": "Introduction to Azure AI Document Intelligence",
          "boundingRegions": [
            {
              "pageNumber": 1,
              "polygon": [0.1, 0.1, 0.9, 0.1, 0.9, 0.2, 0.1, 0.2]
            }
          ],
          "role": "title"
        }
      ],
      "words": [
        {
          "content": "Introduction",
          "boundingBox": [0.1, 0.1, 0.3, 0.1, 0.3, 0.15, 0.1, 0.15],
          "confidence": 0.99
        }
      ],
      "selectionMarks": [
        {
          "state": "selected",
          "boundingBox": [0.1, 0.3, 0.2, 0.3, 0.2, 0.4, 0.1, 0.4],
          "confidence": 0.95
        }
      ]
    }
  ]
}
        

Purpose :

  • Serves as the foundation for labeling by mapping the extracted text to its physical location in the document.
  • Enables visualization of text elements (e.g., yellow background for text, pink borders for selection marks) in Labelling UX.

.pdf.labels.json

The .pdf.labels.json file is integral to Azure's Document Intelligence, encapsulating labeled data for a specific document. This file contains the bounding boxes of identified fields along with their associated metadata, linking them to specific regions within the document. It is used in conjunction with .pdf.ocr.json to visualize and refine labeled data, forming the foundation for effective model training in LabellingUX.

Steps to Generate .pdf.labels.json

  • Call the Document Intelligence Analysis API: Use the POST Analyze API to extract the content of the document. The API returns a JSON response with detected elements, including bounding polygons for text, tables, and other structures.

{your-resource-endpoint}.cognitiveservices.azure.com/documentintelligence/documentModels/{model-name}:analyze?api-version=2024-02-29-preview&stringIndexType=utf16CodeUnit
        

The output will include polygon coordinates for each extracted field.


Automating Labels with the Document Analysis API

  • Convert Polygon Coordinates to Bounding Boxes: Use the dimensions of the document's pages (width and height) to normalize the coordinates. This ensures compatibility with LabellingUX, where the bounding boxes are scaled relative to the document dimensions.

using Newtonsoft.Json.Linq;
using System;
using System.Collections.Generic;
using System.Linq;

public static class LabelsJsonGenerator
{
    public static List<List<string>> GetBoundingBoxes(string word, int pageNumber, JObject analysisResult)
    {
        var boundingBoxList = new List<List<string>>();

        // Get the page object by pageNumber
        var page = analysisResult["analyzeResult"]["pages"]
            .FirstOrDefault(p => (int)p["pageNumber"] == pageNumber);

        if (page == null)
        {
            return new List<List<string>> { boundingBoxList };
        }

        // Find the word object on the specified page
        var wordObject = page["words"]
            .FirstOrDefault(w => w["content"]?.ToString() == word);

        if (wordObject == null)
        {
            return new List<List<string>> { boundingBoxList };
        }

        // Extract polygon and dimensions
        var polygon = wordObject["polygon"]?.ToObject<List<double>>();
        if (polygon == null || !polygon.Any())
        {
            return new List<List<string>> { boundingBoxList };
        }

        double width = page["width"]?.ToObject<double>() ?? 1;
        double height = page["height"]?.ToObject<double>() ?? 1;

        // Convert polygon coordinates to bounding box
        var normalizedBoundingBox = new List<string>();
        for (int i = 0; i < polygon.Count; i += 2)
        {
            double normalizedX = polygon[i] / width;
            double normalizedY = polygon[i + 1] / height;

            normalizedBoundingBox.Add(normalizedX.ToString("F6")); // Precision up to 6 decimal points
            normalizedBoundingBox.Add(normalizedY.ToString("F6"));
        }

        boundingBoxList.Add(normalizedBoundingBox);
        return boundingBoxList;
    }
}
        


labels.json

Example of this conversion is as follows :

Analysis Result Snippet (Input)

{
  "pages": [
    {
      "pageNumber": 1,
      "width": 8.5,
      "height": 11,
      "words": [
        {
          "content": "Machine",
          "polygon": [100.0, 200.0, 200.0, 200.0, 200.0, 250.0, 100.0, 250.0]
        }
      ]
    }
  ]
}
        

Converted Bounding Boxes (Output in .pdf.labels.json)

{
  "labels": [
    {
      "label": "Title",
      "value": [
        {
          "boundingBoxes": [
            [0.117647, 0.181818, 0.235294, 0.181818, 0.235294, 0.227273, 0.117647, 0.227273]
          ],
          "page": 1,
          "text": "Machine"
        }
      ],
      "labelType": "Words"
    }
  ]
}
        

  • Save the .pdf.labels.json File: Save the file using the naming convention <original file name>.labels.json. It should reside in the same directory as the document, .ocr.json, and fields.json.


Validating the updated model

Once the required files (fields.json, .pdf.ocr.json, and .pdf.labels.json) are created, they can be uploaded to Azure Blob Storage. A Shared Access Signature (SAS) URL is then generated for the blob container, which is used to train the model.

Training a Custom Model

To train a custom model, use the following POST API call:

Endpoint:

{{documentIntelligenceEndpoint}}formrecognizer/documentModels:build?api-version=2024-02-29-preview        

Payload Example:

{
    "modelId": "model_nextVersion",
    "description": "Description of the model",
    "buildMode": "neural",
    "azureBlobSource": {
        "containerUrl": "{{blobSASURL}}",
        "prefix": ""
    }
}        

Steps:

  • Submit the API request with the SAS URL pointing to the Azure Blob container containing the training files.
  • The response includes an operation-location in the headers

{{documentIntelligenceEndpoint}}formrecognizer/operations/{operationId}?api-version=2024-02-29-preview        

  • Use the operation-location URL to check the training status by making a GET request.

Training Status:

  • running: Indicates the model is still being trained.
  • succeeded: Indicates the model has been successfully trained and is ready for use.

Once the training is complete, the model can be validated against documents to ensure it correctly extracts the intended fields.


Securing LabellingUX: Using MSAL.js for Secure Authentication

Securing your LabellingUX application, with MSAL.js involves integrating Microsoft's Authentication Library to manage user authentication seamlessly. Here's how to implement this:

  • Install MSAL Packages: Begin by installing the necessary MSAL packages. Execute the following command in your project directory:

npm install @azure/msal-browser @azure/msal-react        

  • Configure MSAL: Create a configuration file, typically named authConfig.js, and populate it with your Azure Active Directory (AAD) credentials:

export const msalConfig = {
  auth: {
    clientId: 'YOUR_CLIENT_ID',
    authority: 'https://login.microsoftonline.com/YOUR_TENANT_ID',
    redirectUri: 'https://localhost:3000',
  },
};

export const loginRequest = {
  scopes: ['user.read'],
};        

Replace 'YOUR_CLIENT_ID' and 'YOUR_TENANT_ID' with your actual Azure AD credentials.

  • Initialize MSAL in Your Application: In your index.tsx, import the necessary MSAL components and wrap your application with the MsalProvider:

import * as React from "react";
import * as ReactDOM from "react-dom";
import { Provider } from "react-redux";
import { ConnectedRouter } from "connected-react-router";
import { history } from "./store/browserHistory";
import App from "./app";
import { store } from "store";
import { initializeIcons } from "@fluentui/react/lib/Icons";
import { PublicClientApplication } from '@azure/msal-browser';
import { MsalProvider } from '@azure/msal-react';
import { msalConfig } from './authConfig';

import "./index.scss";

initializeIcons();
ReactDOM.render(
    <MsalProvider instance={msalInstance} store={store}>
        <ConnectedRouter history={history}>
            <App />
        </ConnectedRouter>
    </MsalProvider>,
    document.getElementById("root")
);        

For a comprehensive understanding and additional examples, refer to the official Microsoft documentation:



要查看或添加评论,请登录

Shoeb Sayyed的更多文章

社区洞察

其他会员也浏览了