登录查看更多内容

Effortlessly Organize Mixed Documents with GCP's Custom Splitter Feature

Vijay Chaudhary

Lead Software Engineer

发布日期: 2025年1月19日

In real-world scenarios, it's common to encounter multiple documents combined into a single, multi-page image or PDF file.? In financial institutions incoming correspondence come in a package where loose pages would come in folder, or a PDF will be scanned by customer - where they scan multiple pages from different document into a single PDF and send via electronic channels. To make complete sense of incoming correspondence you have to solve three main problems,

Separation: Dividing pages into distinct logical documents.?

Classification: Identifying the type of each document.?

Data Extraction: Extracting key fields from the documents.?

Today we will focus on separation. Classification and separation generally go hand in hand as both these activities are closely related. You can have a use case where classification is done first by identifying the type of document by looking at each page. Another use case is you can do separation first by looking at certain elements like barcode or patch codes and then attempt to classify the documents. Another one could be where you have piles of pages for same set of documents but from different customers, there also you would need to separate documents if documents are sent in using high-speed scanners (earlier days you would stick a barcode sticker to separate the documents). Like these there could be various use cases where you need to separate documents. ?

Example Use Cases?

Mortgage Processing - Splitting large packages into individual documents like applications, financial statements, and ID proofs.?

Insurance Claims - Separating bundles of claim forms, photos, and supporting evidence for independent processing.?

Customer Onboarding - Handling bulk submissions where multiple applicants' documents are scanned and sent together.?

With advancements in machine learning, good progress has been made in this field. In this article we will see Document AI custom splitter feature to achieve document separation which internally uses machine learning. Custom splitter is designed to be used to split composite documents (documents made up of multiple classes) into a number of single class documents by identifying each logical document. For example, a mortgage package contains multiple classes within it such as application, income verification, and photo ID. Custom splitter processors are trained from the ground up using your documents and custom classes (labels). Supports PDF, TIFF, TIF, GIF (15 pages, 20MB max)?.

Here are the high-level steps to create a custom splitter. ?

Let’s go through the detailed steps and understand how to create and use these custom splitters. ?

[1] Create Processor? - Choose custom splitter

[2] Once processor is created click on Configure Your Dataset?

Pass your bucket details?

领英推荐

$39.56 billion: The global AI market in financial…

The AI Journal 3 个月前

Referrers, clients and AI: Insights from a Thought…

LMG 10 个月前

Unlocking the Power of Automated Data Extraction with…

BugendaiTech 5 个月前

[3] Add the labels for each document type you want to include for your solution.?

[4] Upload all your documents, it would be a good idea to create folder for each document type. Documents would be auto-labelled by this approach. ?

[5] Once schemas are created and documents are uploaded into bucket, click on Import Documents option and pass the bucket address against each document type as shown. ?

[6] View LABLE STATS and check if you have enough documents added.

[7] Once all these are set, start the model training, check the progress from Manage Versions tab. ?

[8] When training is completed you could see various accuracy scores of the trained mode. If metrics are not good, you could choose to add more documents for training and train again.

[9] Once model is deployed test the newly created separation model by uploading a merged document which has pages from all the documents.

[10] Notice the page ranges from JSON response. Note that this service only gives you the page ranges, it doesn’t separate the pages, for that you must use a separate utility. ?

Summary?

Managing large, mixed documents is a common challenge in industries like finance and insurance. Businesses often receive multi-page PDFs or images containing different document types bundled together. To process these effectively, three steps are key: separating the pages into distinct documents, identifying each document type, and extracting relevant data. Custom Splitter simplifies this separation process by using machine learning to split composite documents into single, logical units. For example, a mortgage file containing an application, income verification, and photo ID can be divided into separate documents. By automating document separation, GCP’s Custom Splitter saves time, reduces manual work and ensures accurate processing. ?

AI-ML & Automations

1,577 位关注者

要查看或添加评论，请登录

Vijay Chaudhary的更多文章

Understanding RAG Evaluation: A Practical Approach to Retrieval Metrics

2025年3月16日

Understanding RAG Evaluation: A Practical Approach to Retrieval Metrics

Retrieval-Augmented Generation (RAG) systems are gaining popularity, helping users find relevant documents to answer…

1 条评论
Splitting Text Right Way - NLTK, SpaCy or Markdown

2025年3月2日

Splitting Text Right Way - NLTK, SpaCy or Markdown

For natural language processing (NLP) working with large pieces of text can be challenging. Many language models have…

1 条评论
Unlocking Entities and Relations: Creating Knowledge Graphs with AI

2025年2月16日

Unlocking Entities and Relations: Creating Knowledge Graphs with AI

GraphRAG is something which is picking up recently, in this article we will try to get to the basics of GraphRag…
Structured Outputs from LLMs: LangChain Output Parsers

2025年2月9日

Structured Outputs from LLMs: LangChain Output Parsers

LLMs are good at generating human-like text (hence called Generative AI), but when it comes to integrating to…
Handling Sensitive Data: Redaction, Masking and Compliance

2025年2月2日

Handling Sensitive Data: Redaction, Masking and Compliance

In today's data-driven world, digital documents containing sensitive information pose challenges to privacy and…
Optimizing AI Workflows with LangChain - A Practical Introduction

2025年1月25日

Optimizing AI Workflows with LangChain - A Practical Introduction

LangChain is a framework for developing applications powered by large language models (LLMs). It helps in simplifying…
Improving AI Contextual Understanding -Retrieval Augmented Generation (RAG)

2025年1月4日

Improving AI Contextual Understanding -Retrieval Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a technique in natural language processing that uses knowledgebase information…

2 条评论
Understanding Custom Classifiers in Google Document AI

2024年12月29日

Understanding Custom Classifiers in Google Document AI

There are three categories of models or services in GCP Document AI – General Document processors (Layout, Form and Doc…
Processing with GCP Document AI: Exploring Pretrained Parsers

2024年12月15日

Processing with GCP Document AI: Exploring Pretrained Parsers

GCP Document AI offers multiple products to process documents for information for different use cases. Below…

2 条评论
Custom Document Extractors with Google Document AI

2024年12月8日

Custom Document Extractors with Google Document AI

GCP Document AI broadly has three categories of document extraction models – General Document processors (Layout, Form…

See all articles

Effortlessly Organize Mixed Documents with GCP's Custom Splitter Feature

Vijay Chaudhary

Lead Software Engineer

领英推荐

AI-ML & Automations

1,577 位关注者

Vijay Chaudhary的更多文章

社区洞察

其他会员也浏览了

Data Governance Use Cases across Multiple Industries

Data Governance Use Cases across Multiple Industries

From Predictive Analytics to Fraud Detection: AI's Role in Reshaping Business

From Capturing Memories to Capturing Data: Kodak Alaris's Strategic Play in the IDP Market

From Penalties to Progress: How AI Agents are Revolutionizing Credit Rating Agencies

21 January 2025

Fintech and AI Impacted by Proposed Reg V Amendment?

Use Cases for leveraging Databricks with ServiceNow in the Fintech industry

Financial AI Solution Market Overview & Growth Rate Forecast for Next 5 Years

Connecting what I'm doing with what I do. Introducing Entity Resolution and Senzing, Inc.

领英推荐

AI-ML & Automations

1,577 位关注者

Vijay Chaudhary的更多文章

Understanding RAG Evaluation: A Practical Approach to Retrieval Metrics

Splitting Text Right Way - NLTK, SpaCy or Markdown

Unlocking Entities and Relations: Creating Knowledge Graphs with AI

Structured Outputs from LLMs: LangChain Output Parsers

Handling Sensitive Data: Redaction, Masking and Compliance

Optimizing AI Workflows with LangChain - A Practical Introduction

Improving AI Contextual Understanding -Retrieval Augmented Generation (RAG)

Understanding Custom Classifiers in Google Document AI

Processing with GCP Document AI: Exploring Pretrained Parsers

Custom Document Extractors with Google Document AI

社区洞察

其他会员也浏览了

Data Governance Use Cases across Multiple Industries

Data Governance Use Cases across Multiple Industries

From Predictive Analytics to Fraud Detection: AI's Role in Reshaping Business

From Capturing Memories to Capturing Data: Kodak Alaris's Strategic Play in the IDP Market

From Penalties to Progress: How AI Agents are Revolutionizing Credit Rating Agencies

21 January 2025

Fintech and AI Impacted by Proposed Reg V Amendment?

Use Cases for leveraging Databricks with ServiceNow in the Fintech industry

Financial AI Solution Market Overview & Growth Rate Forecast for Next 5 Years

Connecting what I'm doing with what I do. Introducing Entity Resolution and Senzing, Inc.