From pixels to information with Document AI
We’re seeing successively difficult problems getting solved thanks to machine learning (ML) models. For example,?Natural Language AI?and?Vision AI?extract insights from text and images, with human-like results. They solve problems central to the way we communicate:
What's next? Well, it's already here, with?Document AI, and keeps growing:
Document AI builds on these foundations to let you extract information from documents, in many forms:
In this article, you’ll see the following:
Processing documents
Processor types
There are as many document types as you can imagine so, to meet all needs, document processors are at the heart of Document AI. They can be of the following types:
Processor locations
When you create a processor, you specify its location. This helps control where the documents will be processed. Here are the current multi-region locations:
In addition, some processors are available in single-region locations; this lets you address local regulation requirements. If you regularly process documents in real-time or large batches of documents, this can also help get responses with even lower latencies. As an example, here are the current locations for the "Document OCR" general processor:
Note: API endpoints are named according to the convention?{location}-documentai.googleapis.com.
For more information, check out?Regional and multi-regional support.
Processor versions
Processors can evolve over time, to offer more precise results, new features, or fixes. For example, here are the current versions available for the "Expense Parser" specialized processor:
You may typically do the following:
To stay updated, follow the?Release notes.
Interfaces
Document AI is available to developers and practitioners through the usual interfaces:
Requests
There are two ways you can process documents:
Extracting information from pixels
Let's start by analyzing this simple screenshot with the "Document OCR" general processor, and check how pixels become structured data.
Input image
Document AI response
The examples in this article show snake-case field names (like?mime_type), using the convention used by gRPC and programming languages like Python. For camel-case environments (like REST/JSON), there is a direct mapping between the two conventions:
Text levels
For each page, four levels of text detection are returned:
In these lists, each item exposes a?layout?including the item's position relative to the page in?bounding_poly.normalized_vertices.
This lets us, for example, highlight the 22 detected tokens:
Here is the last token:
Note: Float values are presented truncated for the sake of readability.
Language support
Documents are often written using one single language, but sometimes use multiple languages. You can retrieve the detected languages at different text levels.
In our previous example, two blocks are detected (pages[0].blocks[]). Let's highlight them:
The left block is a mix of German, French, and English, while the right block is English only. Here is how the three languages are reported at the page level:
Note: At this level, the language confidence ratios roughly correspond to the proportion of text detected in each language.
Now, let's highlight the five detected lines (pages[0].lines[]):
Each language is also reported at the line level:
If needed, you can get language info at the token level too. "Question" is the same word in French and in English, and is adequately returned as an English token in this context:
In the screenshot, did you notice something peculiar in the left block?
Well, punctuation rules can be different between languages. French uses a typographical space for double punctuation marks ("double" as "written in two parts", such as in?"!",?"?",?"?",…). Punctuation is an important part of languages that can get "lost in translation". Here, the space is preserved in the transcription of?"Bienvenue !". Nice touch Document AI or, should I say, touché!
For more information, see?Language support.
Handwriting detection
Now — a much harder problem — let's check how handwriting is handled.
This example is a mix of printed and handwritten text, where I wrote both a question and an answer. Here are the detected tokens:
I am pleasantly surprised to see my own handwriting transcribed:
I asked my family (used to writing French) to do the same. Each unique handwriting sample also gets correctly detected:
… and transcribed
This can look magical but that's one of the goals of ML models: return results as close as possible to human responses.
Confidence scores
We make mistakes, and so can ML models. To better appreciate the structured data you get, results include confidence scores:
Let's overlay confidence scores on top of the previous example:
After grouping them in buckets, confidence scores appear to be generally high, with a few outliers:
The lowest confidence score here is 57%. It corresponds to a handwritten word (token) that is both short (less context given for confidence) and not particularly legible indeed:
For best results, keep in mind these general rules of thumb:
Although all text is correctly transcribed in the presented examples, this won't always be the case depending on the input document. To build safer solutions — especially with critical business applications — you may consider the following:
To learn more about AI principles and best practices, check out?Responsible AI practices.
Rotation, skew, distortion
How many times did you scan a document upside down by mistake? Well, this shouldn't be a concern anymore. Text detection is very robust to rotation, skew, and other distortions.
In this example, the webcam input is not only upside down but also skewed, blurry, and with text in unusual orientations:
Before further analysis, Document AI considers the best reading orientation, at the page level, and preprocesses (deskews) each page if needed. This gives you results that can be used and visualized in a more natural way. Once processed by Document AI, the preceding example gets easier to read, without straining your neck:
In the results, each page has an?image?field by default. This represents the image — deskewed if needed — used by Document AI to extract information. All the page results and coordinates are relative to this image. When a page has been deskewed, a?transforms?element is present and contains the list of transformation matrices applied to the image:
Notes:
Orientation
Documents don't always have all of their text in a single orientation, in which case deskewing alone is not enough. In this example, the sentence is broken out in four different orientations. Each part gets properly recognized and processed in its natural orientation:
Orientations are reported in the?layout?field:
Note: Orientations are returned at each OCR level (blocks,?paragraphs,?lines, and?tokens).
Noise
When documents come from our analog world, you can expect … the unexpected. As ML models are trained from real-world samples — containing real-life noise — a very interesting outcome is that Document AI is also significantly robust to noise.
In this example with crumpled paper, the text starts to be difficult to read but still gets correctly transcribed by the OCR model:
Documents can also be dirty or stained. With the same sample, this keeps working after adding some layers of noise:
In both cases, the exact same text is correctly detected:
You've seen most core features. They are supported by the "Document OCR" general processor as well as the other processors, which leverage these features to focus on more specific document types and provide additional information. Let's check the next-level processor: the "Form Parser" processor.
Form fields
The "Form Parser" processor lets you detect form fields. A form field is the combination of a field name and a field value, also called a key-value pair.
In this example, printed and handwritten text is detected as seen before:
In addition, the form parser returns a list of?form_fields:
Here is how the detected key-value pairs are returned:
And here are their detected bounding boxes:
Note: Form fields can follow flexible layouts. In this example, keys and values are in a left-right order. You'll see a right-left example next. Those are just simple arbitrary examples. It also works with vertical or free layouts where keys and values are logically (visually) related.
Checkboxes
The form parser also detects checkboxes. A checkbox is actually a particular form field value.
This example is a French exam with affirmations that should be checked when exact. To test this, I used checkboxes of different kinds, printed or handmade. All form fields are detected, with the affirmations as field names and the corresponding checkboxes as field values:
When a checkbox is detected, the form field contains an additional value_type?field, which value is either unfilled_checkbox?or?filled_checkbox:
Being able to analyze forms can lead to huge time savings, by consolidating — or even autoprocessing — content for you. The preceding checkbox detection example was actually an evolution of a prior experiment to autocorrect my wife's pile of exam copies. The proof of concept got better using checkboxes, but was already conclusive enough with True/False handwritten answers. Here is how it can autocorrect and autograde:
Tables
The form parser can also detect another important structural element: tables.
In this example, words are presented in a tabular layout without any borders. The form parser finds a table very close to the (hidden) layout. Here are the detected cells:
领英推荐
In this other example, some cells are filled with text while others are blank. There are enough signals for the form parser to detect a tabular structure:
When tables are detected, the form parser returns a list of?tables?with their rows and cells. Here is how the table is returned:
And here is the first cell:
Specialized processors
Specialized processors focus on domain-specific documents and extract?entities. They cover many different document types that can currently be classified in the following families:
For example, procurement processors typically detect the total_amount?and currency?entities:
For more information, check out?Fields detected.
Expenses
The "Expense Parser" lets you process receipts of various types. Let's analyze this actual (French) receipt:
A few remarks:
Procurement documents often list parts of the data in tabular layouts. Here, they're returned as many?line_item/*?entities. When the entities are detected as part of a hierarchical structure, the results are nested in the?properties?field of a parent entity, providing an additional level of information. Here's an excerpt:
For more information, see the?Expense Parser?details.
Entity normalization
Getting results is generally not enough. Results often need to be handled in a post-processing stage, which can be both time consuming and a source of errors. To address this, specialized processors also return normalized values when possible. This lets you directly use standard values consolidated from the context of the whole document.
Let's check it with this other receipt:
First, the receipt currency is returned with its standard code under normalized_value:
Then, the receipt is dated 11/12/2022. But is it Nov. 12 or Dec. 11? Document AI uses the context of the document (a French receipt) and provides a normalized value that removes all ambiguity:
Likewise, the receipt contains a purchase time, written in a non-standard way. The result also includes a canonical value that avoids any interpretation:
Normalized values simplify the post-processing stage:
For more information, check out the?NormalizedValue?structure.
Entity enrichment
Did you notice there was more information in the receipt?
Extracting the information behind the data requires extra knowledge or human investigation. An automated solution would generally ignore this partial data, but Document AI handles this in a unique way. To understand the world's information, Google has been consistently analyzing the web for over 20 years. The result is a gigantic up-to-date knowledge base called the Knowledge Graph. Document AI leverages this knowledge graph to normalize and enrich entities.
First, the supplier is correctly detected and normalized with its usual name:
Then, the supplier city is also returned:
Note: Joan of Arc spent some time in Orléans in 1429, as she led the liberation of the besieged city at the age of 17 (but that's another story).
And finally, the complete and canonical supplier address is also part of the results, closing our case here:
Enriched entities bring significant value:
Here is a recap of the expected — as well as the non-obvious — entities detected in the receipt:
Note: The slightest characteristic element can be sufficient to detect an entity. I've been regularly surprised to get entities I wasn't expecting, to eventually realize I had missed clues in the document. For example, I recently wondered how the expense processor was able to identify a specific store. The receipt only specified a zip code and the retail chain has several stores in my neighborhood. Well, a public phone number (hidden in the footer) was enough to uniquely identify the store in question and provide its full address.
To learn more about the Knowledge Graph and possible enrichments, check out?Enrichment and normalization.
Invoices
Invoices are the most elaborate type of procurement documents, often spreading over multiple pages.
Here's an example showing entities extracted by the "Invoice Parser":
A few remarks:
For more information, see the?Invoice Parser?details.
Barcodes
Barcode detection is enabled for some processors. In this example, the invoice parser detects the barcode (a manifest number):
Note: I didn't have an invoice with barcodes at hand, so I used a (slightly anonymized) packing list.
Barcodes are returned, at the page level, like this:
For more information, check out the?DetectedBarcode?structure.
Identity documents
Identity processors let you extract identity entities. In this US passport specimen (credit:?Bureau of Consular Affairs), the expected fields can be automatically extracted:
For more details about identity processors, you can read my previous article?Automate identity document processing.
Document signals
Some processors return document signals — information relative to the document itself.
For example, the "Document OCR" processor returns quality scores for the document, estimating the defects that might impact the accuracy of the results.
The preceding crumpled paper example gets a high quality score of 95%, with glare as a potential defect:
The same example, at a 4x lower resolution, gets a lower quality score of 53%, with blurriness detected as the main potential issue:
Some additional remarks:
Here's how the image quality scores are returned at the page level:
Likewise, the "Identity Proofing" processor gives you signals about the validity of ID documents.
First, this can help detect whether documents look like IDs. This preceding example is a random document analyzed as NOT_AN_ID:
Here are the corresponding entities:
The preceding passport example does get detected as an ID but also triggers useful fraud signals:
The results include evidences that 1) it's a specimen, 2) which can be found online:
As new use cases appear, it's likely that some processors will extract new document signals. Follow the?Release notes?to stay updated.
Pre-trained processors
Document AI is already huge and keeps evolving. Here is a screencast showing the current processor gallery and how to create a processor from the Cloud Console:
Custom processors with Workbench
If you have your own business-specific documents, you may wish to extract custom entities not covered by the existing processors. Document AI Workbench can help you solve this by creating your own custom processor, trained with your own documents, in two ways:
For more information, watch this great introduction video made by my teammate Holt:
Performance
To capture your documents, you'll probably use either:
With scanners, you'll need to choose a resolution in dots per inch (dpi). To optimize the performance (especially the accuracy and consistency of the results), you can keep in mind the following:
Here are the resolutions needed to capture a sheet of paper:
Note: Dimensions are presented in the horizontal orientation (image sensors generally have a landscape native resolution).
With cameras, the captured dots per inch depend on the camera resolution but also on the way you zoom on the document. Here's an example with an A4 sheet of paper:
This translates into different dpi ranges. Here are indicative values:
To give you an idea, here is how a PNG image size evolves when capturing the same document at different resolutions. The total number of pixels is a surface. Though the PNG compression slightly limits the growth, the size increases almost quadratically:
A few general observations:
To finetune your solution and get more accurate, faster, or consistent results, you may consider the following:
For example:
Sample demo
I put myself in your shoes to see what it takes to build a document processing prototype and made the following choices:
Here is the chosen software stack based on open-source Python projects:
… and one possible architecture to deploy it to production using Cloud Run:
Here is the core function I use to process a document live in Python:
To make sure that "what you see is what you get", the sample images and animations in this article have been generated by the demo:
Note: To avoid making repeated and unnecessary calls to Document AI while developing the app, sample document analyses are cached (JSON serialization). This lets you check the returned structured documents and also explains why responses are immediate in this (real-time, not sped up) screencast.
It takes seconds to analyze a document. Here is an example with the user uploading a PDF scan:
You can also take camera captures from the web app:
Check out the?source code?and feel free to reuse it. You'll also find instructions to deploy it as a serverless app.
More?
Originally published on GitHub
on assignment basis
1 年You are real tech lover. Apart, you share the tech with others like me of non-tech. thank you.
Tech lover
2 年If anyone is interested in reusing parts of it: - Resources compiled in a deck: https://bit.ly/pix2info-res - Source code: https://bit.ly/pix2info-src
AI Advisor (Helping organizations in their AI journeys) | PhD (Geometric Modeling) | Tech Columnist (Marathi)
2 年So, detailed and nicely written Laurent Picard. I am planning to try out these cases and may be, reference them in future talks.