Automating Invoice Classification with Chainlit, Langgraph, Gemini Flash, Tesseract, and?EasyOCR

Automating Invoice Classification with Chainlit, Langgraph, Gemini Flash, Tesseract, and?EasyOCR

In today’s data-driven business environment, processing a large volume of invoices efficiently is a crucial task. Businesses struggle with the time-consuming nature of manually processing invoices, as well as the inconsistent accuracy across different OCR tools. This article presents the first iteration of a solution for automating invoice classification using a combination of Optical Character Recognition (OCR) and Language Models (LLMs), implemented through Langgraph, deep learning-based OCR tools, and Gemini Flash?LLM.

Purpose of the Application

This Proof of Concept aims to simplify the management of PDF invoices by automating data extraction and classification. By utilizing multiple OCR engines and comparing the results through an LLM-based similarity assessment, this system can accurately classify invoices and generate detailed comparative reports. The end result improves workflow efficiency and ensures data accuracy, essential for businesses managing large numbers of invoices.

Problem Statement: Challenges in Invoice Processing

  1. Manual Classification is Time-Consuming: Manually sorting and extracting data from invoices leads to inefficiency, especially when businesses are handling thousands of documents daily.
  2. OCR Accuracy Varies Across Tools: Many OCR engines produce varying results based on the document’s format or quality, requiring manual validation.
  3. Need for Reliable Similarity Assessment: Inconsistent data extraction makes it difficult to reliably classify documents based on textual similarities, which is important for detecting duplicates or errors.

Additionally, third-party software often faces limitations such as:

  • Struggles with handling diverse invoice formats, leading to errors.
  • Bugs that are poorly handled by service providers.
  • High costs related to licensing and infrastructure.

Solution Overview: Application Workflow

The system involves the following steps:

  1. Document Ingestion: Retrieves PDF invoices from an input folder. Previously, users upload invoices to a predefined folder.
  2. Parallel OCR Processing and Text Extraction: The system triggers OCR on the PDFs, compares results, and computes similarity. It executes both EasyOCR and PyTesseract in parallel to perform OCR on the PDFs. The process includes image preprocessing, text extraction, and post-processing.
  3. Comparison and Similarity Calculation: Using Gemini Flash for textual comparison and maths formula for a simple cosine similarity computation, the system determines how closely the OCR outputs match.
  4. Data Classification with Gemini Flash Function call feature?: the system generates a detailed markdown report for each invoice and displays classification summaries on the UI. Based on similarity scores, the system classifies (Using AI langchain function call feature) and copies the processed files to categorized folders (low, medium, or high similarity).

OCR Nodes: Detailed?Overview

The system uses two OCR tools:

  • PyTesseract: Google’s OCR engine with support for multiple languages and customizable settings.
  • EasyOCR: A deep learning-based OCR tool that supports over 80 languages and is suitable for complex backgrounds.

Both tools are used in parallel to extract text, which is then compared using a language model for classification.


Before going trough both OCRs, pdf are converted page by page in images (png) and pre processed including various steps :

  • conversion to grey
  • denoised
  • thresholded
  • contrast_enhanced
  • resized

This preprocessing step is totally improvable.

Comparison Node: Report and Classification

The comparison node evaluates the OCR extracts based on key information like invoice number, emitter, and receiver. Using embeddings from the sentence-transformers/all-MiniLM-L6-v2 model, it computes cosine similarity to assess how closely the extracted texts match. The output includes a markdown report and similarity score for each invoice.

File Classification and Similarity Assessment

Based on the the similarity assessment the LLM agent classifies invoices and the markdown report into “low,” “medium,” or “high” similarity folders, using function-calling capabilities via the Langchain framework.

this is done with that piece of code to configure the LLM agent with the FS toolkit:

from langchain.agents import initialize_agent, AgentType
from langchain_community.agent_toolkits import FileManagementToolkit
from langchain_google_genai import ChatGoogleGenerativeAI

from prompts.classification_prompt import classification_prompt


class ClassificationAgent:
    def __init__(self):
        # Initialize LLM
        self.llm = ChatGoogleGenerativeAI(
            model="gemini-1.5-flash-latest",
            temperature=0,
            max_tokens=None,
            timeout=None,
            max_retries=2,
            # other params...
        )
        # Define the File System tools
        self.working_directory = os.path.abspath('./')
        self.toolkit = FileManagementToolkit(root_dir=self.working_directory)
        self.tools = self.toolkit.get_tools()
        self.agent = initialize_agent(self.tools, self.llm, agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
                                      verbose=True,
                                      agent_executor_kwards={"handle_parsing_errors": True})        

and this prompt instruction is passed to the agent?:

def classification_prompt(working_directory, file_path, report_path, similarity):
    prompt = (
        f"Here are the instructions for organizing the files based on their similarity scores:\n\n"
        f"1. There is an original PDF file located at: '{file_path}'.\n"
        f"2. A report file has been generated and is available at: '{report_path}'.\n\n"
        f"Now, follow these rules based on the average similarity score ({similarity}):\n"
        f"- If the average similarity is less than 0.5, move the files to the folder: "
        f"'{working_directory}/data/output/low_similarity'.\n"
        f"- If the average similarity is between 0.5 and 0.8, move the files to the folder: "
        f"'{working_directory}/data/output/medium_similarity'.\n"
        f"- If the average similarity is greater than 0.8, move the files to the folder: "
        f"'{working_directory}/data/output/high_similarity'.\n\n"
        f"Make sure to copy  the original PDF and to cut and paste the report file to the correct folder based on this logic."
    )

    return prompt        

It could have been done with a simple piece of algorithmic code but this is a simple demonstration of how to use tool calling with LLM.

Constraints and?Outputs:

In this example we are processing two “little“ pdf files, one is clean invoice and the other is noisy, her is the UI final result:

The other output provided by the tool is located in the /data/output folder (processed folder is used to persist original images and preprocessed version of these images for the OCR):

As you can see Gemini Flash + the tool calling on file management toolkit feature has properly classified files in the dedicated output with regard to the similarity computation done for both OCR extracts.

Future improvements include:

  • Implementing parallel processing to enhance performance.
  • Conducting finer-grained similarity checks based on invoice headers.
  • Improving image preprocessing techniques to boost OCR accuracy.
  • Introducing the usage of a Layout LM model for better invoice layout understanding.

In summary, while the tool already offers robust capabilities, these improvements will significantly enhance its performance and accuracy. By refining the similarity checks, optimizing image preprocessing, and leveraging advanced models like Layout LM, the system will be better equipped to handle a variety of invoice formats with increased efficiency and precision.?

This tool is only a prototype and is only an heuristic way of exploring AI abilities to enhance some enterprise processes.

It is available on this github repo

要查看或添加评论,请登录

Samir Kerroumi的更多文章

社区洞察

其他会员也浏览了