Custom Document Extractors with Google Document AI
GCP Document AI broadly has three categories of document extraction models – General Document processors (Layout, Form and Doc OCR), specialized processors (invoices, tax forms, lending forms, contracts etc.) and custom processors (custom classifiers, splitters and extractors).?In this article, we will focus on Custom Extractor feature also sometimes called CDE (custom document extractor). CDEs are suited for image data extraction with specific business documents (forms unique to your organization). This processor identifies and extracts entities from documents, one can then use this trained processor on new documents for data extraction. ?
High level steps for CDE creation and testing, ?
Let's try the same steps in GCP console for better understanding. ?
Step [1] Enable Document?AI API in console. Once enabled, go to Custom Processors section and create custom extraction as highlighted below. ?
Step [2] Enter the name and other details. On the next screen click on Get Started under Customize tile.?
Step [3] Create fields and declare data types. ?
Step [4] Add document from device or Google cloud (or local device) storage for training and testing for labelling (annotation) exercise. ?
领英推荐
Step [5] Start labelling, documents are auto labelled by Generative AI. Confirm the documents if you think the suggestions are correct. See an example below. Repeat the steps for all training & testing documents.?
Review each field carefully, suggestions could be wrong. In the example SSN value is not read properly.
You can run into issues if the minimum criteria for creating custom model are not met. A minimum of ten documents are needed each for training and test datasets. Once you have met all the prerequisites, you should be able to create the model. ?
Step [6] Once annotation exercise is completed, go to Train a custom model tile and select Create New Version.
Step [7] To check the status of model availability, go to Deploy & Use tab and check the status. ?
Step [8] Once the model is available run the evaluation matrix to check the accuracy rate against test document set. Below metric is obtained from out of the box available Generative AI based model (you can select your model from top left corner).
Step [9] Once model training is completed you can upload a document and see if it is able to extract the required fields. Once you are happy with evaluation metric on test set and few random unit test - deploy and start using it for production images.
Summary??
We saw the creation and testing of Custom Document Extractors (CDE) in Google Cloud Platform's Document AI. CDEs are ideal for extracting data from business documents, such as unique forms specific to an organization. The process involves initializing a custom processor, defining the processor schema with fields and data types, uploading documents for training, and annotating data manually. Generative AI can be leveraged to auto-label documents, reducing manual effort. Once sufficient labeled data is available, a training job is initiated to fine-tune the processor. After training, the processor is tested for accuracy on a separate dataset. Once validated, the model can be deployed for production use to process similar documents. By following these steps, users can build tailored solutions for enterprise.?