Supplementing Invoice Extraction with Generative AI Technology
Most template-based data extraction technologies use a pattern-matching or a rule-based approach which simply matches your pattern to the information in the text. This means that in order to extract numerical data from a text document, you will write a regular expression that matches and highlights all numbers in the text and then creates a template for extracting numerical data.??
These templates have a one-to-one relation with the file layout, allowing you to extract output with 100% accuracy for documents sharing the same layout. This means that a template designed for extracting numerical data strictly with two decimal places can extract 2.01 from any document but will not extract 2.001, or “two point zero one”.?
The applications using this method with advanced capabilities to define multiple flexible patterns make for an excellent data extraction solution only when the incoming unstructured reports have a consistent layout. In such cases, you can extract information from thousands of form-like documents such as invoices using a single template supplemented with an automation logic.??
However, pattern-matching template creation is a tedious and time-consuming process that involves repetitive steps. Even for invoices that typically have a simple layout, the minimum time it takes to create a template is around 10 minutes since we need to define a pattern and identify relevant data points. This effort seems trivial when we are dealing with a few templates, but it's neither efficient nor practical when we talk about large-scale data extraction automation.??
Imagine having to create 100 different layouts. You'll have to sit through at least 1000 minutes to create 100 templates. Templates will have to be created to counter every single variation and will need to be added to the automation cluster, increasing manual effort as well as system load. This is where generative AI comes into play with its ability to learn and adapt to diverse document layouts increasing the efficiency from 1000 minutes per 100 invoices to mere seconds. ?
Role of Generative AI in Data Extraction?
Generative AI draws inspiration from evolutionary computation models, and its impact has been driven by the convergence of two key factors: availability of big data, and the exponential growth in computing power, notably GPUs. These factors combined have accelerated the progress of deep neural networks (DNN) in the complex domains of natural language processing and unstructured data management.??
One of the most significant use cases for Generative AI is the entity extraction from the corpus of the unstructured text. By initially transforming the data into a structured format, generative AI models can be utilized to extract data from unstructured file types. Natural language processing (NLP) methods, including tokenization, speech-to-text conversion, and part-of-speech tagging, can be used to do this. Once the data is formatted, a generative AI model may process it to extract the needed information. Generative AI models such as transformers can learn to recognize patterns and correlations in the data by being trained on massive text and code datasets. They can now extract data from unstructured data that would be challenging or impossible to retrieve using more conventional techniques.??
Many large language models (LLMs) such as GPT, BERT, and OPT are generative models based off of transformers technology. Transformer is a type of deep learning model that works on an “attention” mechanism in order to increase the performance of a sequence-to-sequence data processing models. This attention mechanism works in two ways, self-attention and multi-head attention. Self-attention captures the importance between each word in the input sequence, while multi-head attention helps model learn various types of attention patterns by allocating each “head” to a specific relationship.??
Consider extracting key-value pairs from invoices, self-attention will help understand that a date mentioned earlier in the document is relevant when extracting information about a corresponding payment. With multi-head attention, one head of the model might pay more attention to numerical values (for extracting amounts), while another might focus on textual patterns (for extracting vendor names). The outputs of multiple attention heads are usually concatenated in some way to provide a richer representation of the context that is ideal in the case of table detection.?
Challenges with Data Extraction Templates?
Eliminating the manual aspect has been a persistent challenge for our data extraction solution, Astera ReportMiner. Over the years, Astera added some semi-automatic features to support template generation, including Line Indicators and Auto Create Fields.??
? ?
To eliminate the manual effort as much as possible, we also experimented with combining computer vision techniques with natural language processing, atop a heuristics layer to automate layout generation on simpler invoices and purchase orders. This feature called Auto Generate Layout (AGL) used multiple open-source python libraries such as tabula-py for table detection and - spaCy for key-value pair extraction to automatically construct the layout. However, it came with its own set of limitations that made it difficult to scale the algorithm for more complex use-cases while maintaining speed and accuracy.??
Astera’s Attempt at GenIE – Generative Information Extraction?
The venture with generative AI began early this year and resulted in the development of AI-recommended templates. This breakthrough technology solved our long-standing limitation of extracting specific and relevant data points from documents with continuously changing layouts. This feature allows us to convert unstructured information to a structured table with a single trigger, eliminating 99% of the manual effort in the post-processing stage. Not only that, the GPT-integrated technology significantly outperforms all traditional or handcrafted features in terms of accuracy, capability, and processing speed.?
Example Use-Case?
To start off, let’s focus on the most common unstructured form-like document, i.e., an invoice. The data regions in this document can be categorized into two components:??
Key-value pairs and line items are two diverse sets of textual artifacts each with their own properties, design, and purpose. Key-value pairs usually come in an exhaustive set of triplets (key, relation, value) that have consistent positional coordinates relative to each other. For example, a value can either be found in the same line as the key, or in the next line with an in-between separator which is usually a colon in the case of invoices.???
On the other hand, line items are unstructured and sometimes unformatted tabular data that may or may not have graphical borders. Unlike database tables, line items lack consistency in the number of rows, columns, ordering of the elements, and may contain hierarchical relationships. Traditionally, detecting tables on a text file has been a complex problem due to variations in its layout together with the inability of models to “classify” a structure as tabular based on its content alone. Computer vision algorithms such as Hough transform has been relatively successful in detecting graphical borders, but challenges continue to persist for tables without borders?
Therefore, extracting these varying artifacts of information calls for a deep learning method specific to each type along with a logic that isolates its regions for a sophisticated extraction process. This analysis forms the basis of a data extraction strategy adopted behind AI recommended templates.??
AI Recommended Template - Architecture Explained?
The process starts by making consecutive real-time API calls to gpt-4.0, similar to a sequence-to-sequence task of QnA. A single API call includes the tokenized invoice document and the system-defined instructions. The system prompt contains instructions to establish the context-aware extraction along with an output parser. The user prompt would receive the text chunks from the source file. The extraction essentially works on the principles of semantic matching and attention mechanism of generative pre-trained models.??
领英推荐
Line-Item Detection?
Line Items are the first textual artifact to be extracted from the invoice. The AI algorithm follows a step-by-step approach of table detection by mapping line items to a tabular format.??
Tokenization and positional encodings:?
The invoice document is tokenized into words and subword units, and positional encodings are added to capture the position of tokens within the documents. This step is crucial for preserving the table structure and the word order.?
Self-attention for contextual understanding:?
Self-attention helps the model understand relationships between tokens within the table, which is essential for identifying rows, columns, and cell values. For recognition of numerical values in a single column such as Price, it allows its tokens to attend to each other more strongly as compared to tokens in other fields. Similarly, tokens in the same row can also attend to each other, helping identify records.??
Multi-head attention for adaptability:?
Multi-head attention plays a vital role in adapting to different table layouts and content.? Each head focuses on different aspects of the table, such as recognizing headers, identifying numerical values, or handling string labels. The output is combined to provide a comprehensive view of the table structure and values for the extractor model.??
Post-processing and table recognition:?
After applying self-attention and multi-head attention, the model can recognize the table structure, including headers, rows, and columns. Post-processing steps can involve identifying specific headers (e.g., "Item," "Material," "Price") or multi-headers and extracting cell values. An output layers predicts the tabular data, including cell values.??
The information extracted and parsed from the text - values and associated metadata of tables - is then exported in a JSON format.?
Identifying Key-Value Pairs?
The second iteration of this algorithm is designed to identify and extract key-value pairs.??
Preprocessing:?
The recognized tables are whitewashed such that only text and key-value pairs remain in the text file. The remaining text is tokenized into words and subword units, and positional encodings are added to capture the word of each key and value.??
Self-attention and multi-head attention:?
The model processes the tokenized documents through multiple layers of self-attention and multi-head attention. During this process, it calculates the relevance of each word to every other word through similarity index of query with key. With self-attention, the model can understand that a “Date” mentioned earlier in the invoice is relevant when extracting value of “Sales Order#”. With Multi-head attention, one head might pay more attention to numerical values (Phone, Invoice Number), while another might focus on language patterns such as separators and hyphens.??
Output Layer?
After attention layers, an output layer essentially uses all its learning of document patterns to extract keys (e.g., “Date”) and its value (12/05/2020). A semantic matching is performed to verify whether the extracted text is a “reasonable” and “meaningful” value for the associated key.??
If key-value pairs exit in the form of sentences, for example: "Contact Person is Ayesha," the engine will break the sentence, tokenize it, compute weights for query and key, and identify that "Contractor Name" is the key here and therefore, "Ayesha" should be its value. This is because of the part-of-speech tagging capability of the natural language models.??
In most cases, the key-value pairs do not follow a sentence structure.??
Reverse Engineering?
The final layout is displayed in report model designer with an option to edit the layout that you may or may not have a need to open. For this, we take the JSON containing the information online-items and the JSON containing the information on key-value pairs, reverse engineer it to create respective regions, and display both in a hierarchical schema inside the ReportMiner designer.??
Field Verification Check?
A field verification check is conducted by default at the end of the template-generation process to ensure the quality of extracted output. In case, any field in an AI-recommended template fails this verification check, the template is dropped into the “Erroneous Report Model” folder with a warning flag to inform you about the need for manual intervention.?
Future Prospects?
Using APIs to extract data from documents, while effective and cost-efficient, has its own set of drawbacks, such as latency, limited control, network dependency, risk to data privacy, and security concerns among others. In contrast, an in-house finetuned large language model would allow for more control and customization, while purging potential risk of a data breach and security crisis. For the case of extracting data from form-like documents, a better approach is to finetune an LLM in a way that it understands the layout to extract relevant key-value pairs and line items with context-awareness and semantic parsing.?
Conclusion?
Traditional methods of unstructured data extraction such as pattern-matching have gone through decades of refinement to achieve credible results, but they lack efficiency, speed, scalability, and flexibility. Technologies have been upgraded to cater to speed and scalability to some extent with process automation and advanced pattern logic. However, a real revolution came forth with advancements in methods of generative artificial intelligence, which are known for their niche in dynamic natural language processing.??
The significance of generative AI, large language models (LLMs) in particular, lies in its ability to efficiently synthesize high volume of form-like documents irrespective of the format, layout, and patterns. It can decipher and extract complex textual artifacts such as key-value pairs and tabular data with ease and efficiency. With the aid of transformers and its “attention” mechanism, generative AI goes beyond traditional approaches, eliminating a need for handcrafted templates and robotic process automation.??
?
Area of investigation: Generative AI, transformers, Attention Mechanism, Large Language Models, Intelligent document processing, form-like documents, Key-value pairs, table detection, table extraction?