Understanding Custom Classifiers in Google Document AI
There are three categories of models or services in GCP Document AI – General Document processors (Layout, Form and Doc OCR), specialized processors (invoice, tax forms, lending forms, contracts etc.) and custom processors (custom classifiers, splitters and extractors). In this article we will focus on the Custom classifier feature - CDC (custom document classifier). CDC is?for document classification, mainly used for identifying type of business documents which are not in general categories of document (like invoice, passport etc.). Classifying a document or identifying type is an important pre-requisite for extraction – our end goal of document capture is to lift the field data automatically from image and export it, and to know which data fields needs to be lifted you need to know the type of the document (e.g. you won’t be able to lift an invoice number or invoice date fields?from gas bill document). This custom model is trained using specific documents from business and custom classes are created as per the requirement of the implementation.??
High level steps to create and use a GCP custom classifier, ?
Let's try the same steps in GCP console for a better understanding.???
Step [1] Enable Document AI API in console. Once enabled, go to Custom Processors section and create custom classifier as highlighted below.??
Step [2] Enter the name and other details. On the next screen click on Configure Your Dataset button – opt for google managed storage or pass the storage bucket details you already have one for dataset configuration. ?
Step [3] Create a storage bucket to import documents which need to be part of the classification training. Upload the sample training and test images into this bucket. ?
Step [4] Click on the import documents button and pass the storage bucket address. Keep the auto-split configuration if all your documents are in one folder. If you have separate folders for training and test sets, you can add folders for the same separately. ?
It will take a few minutes for all documents to be imported.?
Step [5] Click on edit schema and add document labels or document type names. ?
Step [6] Double click on any document tile and start labelling exercise – Categorize which image belongs to which class manually, this information will be used to train the model. Select the class -> click on Mark as Labelled -> Continue doing the same for other documents. ?
Review each image and select the document type. ?
领英推荐
Step [7] Once the minimum number of documents is labeled, try training the model by clicking on Train New Version.?
You can run into some issues if the minimum criteria for creating custom model are not met. A minimum of ten documents are needed for training and two for the test dataset. Once you have met all the prerequisites, you should be able to create the model.??
Note – To overcome these issues import and label more documents.???
Step [8] Enter the version and click on the Start Training button.
Step [9] To check the status of model training status, go to Manage Versions and check the status.??
Step [10] Once training is completed. Check the F1, precision and recall score for metrics on test set. This will give you an indication of how well the model is behaving.
Next you can deploy if you are happy with metrics else you can train the model with more sample to import
Step [11] Once deployment is completed you can test by uploading an image. Go to ?Evaluate &Test tab and upload a document.?
In few seconds you can see the Document classification results and percentage confidence score.
Step [12] Once the model is available you can make API requests and get classification results.
Sample endpoint - https://us-documentai.googleapis.com/v1/projects/75XXXXXXX803/locations/us/processors/7dd4dazzzzzzzz429/processorVersions/21YYYYYY191af7cc:process
Request body format
Summary??
In this article, we explored the creation and use of Custom Document Classifiers (CDC) within Google Cloud's Document AI platform. CDCs are good for categorizing unique business documents, enabling next data extraction processes. The workflow involves initializing a custom classifier, preparing a dataset, and importing documents. Users annotate data manually or utilize auto-labeling (once first model version is deployed) to minimize annotation efforts. Also we covered splitting datasets into training and testing sets, classification model is built with training set, validation metric is created with test set, and if accuracy scores are poor, it can be optimized with more samples and retraining.?
Once model is deployed, it can be tested with production like documents. This process enables businesses to classify document types, for next step which is automated data capture. Use Document AI to meet specific document capture classification needs, improving automatic document classification accuracy and reducing manual labor.?