Generative AI to improve OCR
OCR enhanced with Generative AI

Generative AI to improve OCR

Introduction

Optical Character Recognition (OCR) technology plays a crucial role in various industry sectors by enabling the extraction and interpretation of text from images, scanned documents, or other visual inputs.

Below some example of OCR utilization for a set of industry

  • Banking: OCR is employed to extract information from scanned checks, invoices, and various financial documents. This improves accuracy in data entry and speeds up transaction processing.
  • Retail: OCR is used to digitize and manage inventory data by extracting information from product labels and barcodes. This helps in maintaining accurate stock levels and reducing manual errors. Another retail use case is related to the extraction of data from receipts, facilitating expense tracking, and improving overall financial management.
  • Automotive: OCR is employed in supply chain processes to extract information from shipping documents, invoices, and packaging labels, enhancing transparency and efficiency.
  • Pharmaceuticals: OCR assists in the extraction of information from regulatory documents, ensuring compliance with industry standards and regulations.

There are a lot of solutions able to address OCR use cases, with different level of accuracy, both open source and proprietary. Below some example:

  • IBM Datacap: IBM? Datacap acquires documents, extracts useful information from them, and feeds them into other business processes downstream. Its strength is its ability to complete these tasks with a high degree of automation, flexibility, and accuracy.
  • Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, layout elements, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract specific data from documents
  • Google Cloud Vision API: Integrates Google Vision features, including image labeling, face, logo, and landmark detection, optical character recognition (OCR), and detection of explicit content, into applications.
  • Azure AI Vision: The cloud-based Azure AI Vision API provides developers with access to advanced algorithms for processing images and returning information. Microsoft's Read OCR engine is composed of multiple advanced machine-learning based models supporting global languages. It can extract printed and handwritten text including mixed languages and writing styles.
  • Tesseract: Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text?layer?and outputs the document?into a new searchable text file, PDF, or most?other popular formats. Tesseract?is highly customizable and can operate using most languages, including multilingual documents and vertical text.

In some cases, OCR solutions are not able to extract the expected result and this can be related to different reasons, including:

  1. quality of the image
  2. text is not digital but handwritten

Address these cases is not so simple and is the reasons of principal error in OCR use cases.

Empowering OCR solutions with Generative AI

If you read my previous blogs (Innovative approach to AI project delivery with Generative AI, Unlocking the power of generative AI to visualize functional requirements, Generative AI for tabular data explanation: prompt limit is not a limit, AI pipeline to "play a picture of a musical score", and its implication in generative AI, Talking with a GraphDB leveraging generative AI, Generative AI impact on data platform solutions), you discovered that generative AI is not only about text generation but it is a technology able to open the scene to a large set of use cases.

Now, if we consider OCR use case, Generative AI can be used to improve the output of an OCR service, being able to reconstruct the generated output, detecting and correcting OCR errors, if they occurs.

Consider the following handwritten example, captured with a smartphone camera:

Handwritten text.

Below an example of OCR results, extracted directly using native OCR feature of the smartphone that acquired the picture:

Native smartphone OCR tool.

Below the extracted text:

- Draumerely muprove propuctivity
- "SUPER INTELLIGENT AN EVERYWHERE" F revolutia
- USE CASE
1) Search engive
TODAY
→ TOMORROW
1 RSSULT!
the corvet de
auswers
2) ADVERTISMENT /e - COMMEReL
Eig. AMAZON → every user will see a ol
antomaticelly jenerale        

As you can see, there are a lot of errors due to the quality of the picture and to the fact that text was handwritten, so more complex to analyze.

As I mention in each of my GenAI blog series, GenAI happened!

Now we can use generative AI, in particular, a large language model, to identify and correct unmanaged sentences. In this example, I'll use IBM watsonx as a generative AI tool to complete my task.

First of all, I formulate a prompt (request) for the LLM, that is the one below:

You have following sentences extract from an OCR process. 
You need to recognize the words/sentences that have not been correctly matched from original picture and correct the overall text.
Generate an output including a list of unmanaged words or sentences with the correct reconstruction.

Below the sentences to analyze:
- Draumerely muprove propuctivity
- "SUPER INTELLIGENT AN EVERYWHERE" F revolutia
- USE CASE
1) Search engive
TODAY
→ TOMORROW
1 RSSULT!
the corvet de
auswers
2) ADVERTISMENT /e - COMMEReL
Eig. AMAZON → every user will see a ol
antomaticelly jenerale         

Below the output generated by LLM:

OCR reconstruction via Large Language Models.

As you can see, the Large Language Model did a great job, resolving all errors from the smartphone OCR system (to be honest, 1 error is still present; it is your mission to detect it and reformulate the prompt to avoid this error!)

Conclusion

In conclusion, Optical Character Recognition (OCR) technology is a very important technology in many industry sectors able to enhance processes speed and efficiency interpreting text from visual inputs.

While there are numerous OCR solutions available, both open source and proprietary, addressing the challenges tied to image quality and handwritten text remains crucial to improving OCR performance.

By integrating Generative AI into OCR systems, we can further enhance their capabilities and minimize errors, thereby driving more effective and reliable results. As industries continue to embrace and adopt OCR technology, the combination of OCR and Generative AI promises to unlock new levels of productivity, accuracy, and automation across a wide range of use cases and applications.

#OCR #OpticalCharacterRecognition #GenerativeAI #IndustryInnovation #Banking #Retail #Automotive #Pharmaceuticals #AIinBusiness #TextExtraction #MachineLearning #DocumentProcessing #DataAccuracy #Automation #watsonx

Jeroen Hellingman

Sоftwаrе Еnginееr at Triodos Bank

9 个月

At Project Gutenberg Distributed Proofreaders, we use a page-by-page interface where volunteers correct the output of OCR manually. I would very much like to try this out, using this approach, for handling the first round of corrections automatically. I will look into the degree this will be possible, especially with older texts, which are often hard to read, and use non-standard spellings (which we want to retain).

要查看或添加评论,请登录

Simone Romano的更多文章

社区洞察

其他会员也浏览了