Shakti 4B's OCR Capabilities: A Comprehensive Evaluation

Shakti 4B's OCR Capabilities: A Comprehensive Evaluation

In the dynamic field of Optical Character Recognition (OCR), Mistral AI's recent introduction of Mistral OCR (Mistral OCR 2503) has attracted significant attention. The model boasts an impressive overall accuracy of 94.89%, surpassing established solutions like Google Document AI (83.42%) and Azure OCR (89.52%). Its ability to process up to 2,000 pages per minute and support for multiple languages positions it as a formidable player in the OCR domain.

In light of these advancements, we conducted a comprehensive benchmarking study to evaluate the performance of Shakti 4B, developed by SandLogic Technologies, in comparison to Mistral OCR (Mistral OCR 2503).

Evaluation Methodology: Utilizing ChatGPT as an LLM Judge

To ensure an unbiased and thorough assessment, we employed ChatGPT as a Large Language Model (LLM) judge in our evaluation process. This approach leverages ChatGPT's advanced natural language understanding capabilities to analyze and compare OCR outputs effectively. The methodology involved the following steps:

  1. Selection of Test Documents: A diverse set of documents was curated, encompassing various formats such as standard PDFs, handwritten texts, PPT slides, tables, billboards, conversation transcripts, mathematical formulas, rotated/angled texts, and complex headlines.
  2. OCR Processing: Both Shakti 4B and Mistral OCR were applied to each document to extract text and relevant information.
  3. ChatGPT Evaluation: The outputs from both OCR models were presented to ChatGPT, which analyzed them based on accuracy, formatting preservation, contextual understanding, and error rates.
  4. Scoring and Analysis: ChatGPT assigned accuracy percentages and provided observations for each test case, facilitating a detailed comparison between the two OCR models.

OCR Benchmarking Results Summary:

Shakti 4B Vs Mistral AI OCR (Mistral OCR 2503)

Summary of Performance

Shakti-4B OCR:

  • Excelled in structured extraction tasks (tables, slides), handwritten cursive, and context-rich visual data (billboards, slides).
  • Highly accurate with exceptional robustness, particularly in complex multimodal OCR scenarios.
  • Strengths: Handwriting, structured text, and general multimodal capabilities.

Mistral-OCR:

  • Showed superior handling of conversation transcripts and more detailed accuracy in mathematical extraction, though with minor truncation.
  • Slightly more effective in capturing detailed textual nuances in headlines.
  • Strength: Timestamping and speaker identification in dialogues, precise detail extraction.


Let's dive deep into the examples

To illustrate the practical implications of these findings, let's explore specific examples where Shakti 4B's OCR capabilities shine:

1. Standard PDF Image

Input Image and Prompt given: Extract the text from the image.

Shakti-4B Output:

                6.1 Results for Shakti-VLM-1B
               6.1.1 Performance Highlights
               Shakti-VLM-1B demonstrates exceptional performance across multiple benchmarks despite its                              	            compact size. The model achieves high performance in several key areas:
o	MMU (Multimodal Massive Multitask Understanding): Shakti-VLM-1B achieves 42.5% on the validation set, surpassing all comparison models of same parameter and competitive to the latest Qwen-2.5VL 3B[36] model.
o	Document and Text Understanding: Strong performance on DocVQA, TextVQA, and OCRBench demonstrates the model's robust text recognition and document understanding capabilities.
o	Chart Understanding: Leading performance on ChartQA indicates superior ability to interpret and reason about visual data representations.
o	General Multimodal Evaluation: Shakti-VLM-1B achieves the highest score on MME with 1910.62 points and MMStar with 50.13%, showcasing its balanced capabilities across diverse multimodal tasks.
o	Mathematical Reasoning: Strongest performance on MathVista among models in its size class, demonstrating advanced visual mathematical reasoning capabilities and competitive to the latest model.        

ChatGPT Observation: Shakti 4B Achieved 98% accuracy with minor text variations.

Mistral-OCR Output:

6.1 Results for Shakti-VLM-1B
              6.1.1 Performance Highlights

                 Shakti-VLM-1B demonstrates exceptional performance across multiple benchmarks despite its   	compact size. The model achieves high performance in several key areas:

o	MMMU (Multimodal Massive Multitask Understanding): Shakti-VLM-1B achieves $42.5 %$ on the validation set, surpassing all comparison models of same parameter and compitative to the latest Qwen-2.5VL 3B[36] model.

o	Document and Text Understanding: Strong performance on DocVQA, TextVQA, and OCRBench demonstrates the model's robust text recognition and document understanding capabilities.

o	Chart Understanding: Leading performance on ChartQA indicates superior ability to interpret and reason about visual data representations.

o	General Multimodal Evaluation: Shakti-VLM-1B achieves the highest score on MME with 1910.62 points and MMStar with $50.13 %$, showcasing its balanced capabilities across diverse multimodal tasks.

o	Mathematical Reasoning: Strongest performance on MathVista among models in its size class, demonstrating advanced visual mathematical reasoning capabilities and competitive to the latest model.        

ChatGPT Observation: Lower accuracy (95%) due to several word errors (e.g., spelling mistakes and incorrect numbers).

Winner: Shakti-4B, with notably better accuracy and fewer errors.


2. Handwritten Text Extraction (Cursive)

Input Image & Prompt given: Extract the text from the image.


Shakti-4B Output:

I'm curious about what you guys think of my cursive. Started practicing about a month ago. My end goal is having a handwriting similar to that of the people who lived in the 1800s. Feel free to comment any resources or tips you think would make me improve. Thank you for your time. Have a great day/night.        

ChatGPT Observation: Achieved perfect (100% accuracy), flawlessly extracted handwritten content without errors.

Mistral-OCR Output:

In curious about what you guys think of my cursive. Started practising about a month ago. My end goal is having a handwriting similar to that of the people who lived in the 1900s. But fice to comment any resources or types you think would make me improve. Thank you for you time. Have I have a great day. I right.        

ChatGPT Observation: Managed only 85% accuracy, with significant errors and incorrect interpretation of handwritten content.

Winner: Shakti-4B, with perfect extraction.


3. PPT Slide Extraction

Input Image & Prompt given: Extract the text from the image in proper output format.

Shakti-4B Output:

Only unified agent-based and agentless platform for code-to-cloud protection
Pre-Runtime Protection
o	Shift Left | Image Assessment | iAC and Container Scanning | RBAC | APIs | Automatic Remediations
Agent-Based Runtime Protection
o	VM | Containers | Serverless
o	AWS
o	Hybrid Cloud
o	ML | IOA | NGAV | EDR | Vuln. Mgt. | Managed Hunting
Agentless Cloud Security
o	Visibility
o	Misconfigurations
o	Indicators of Attack
o	Applications
o	CIEM and Compliance
Cloud adversary threat intel powering security & compliance
V24.04.26        

ChatGPT Observation: Achieved 95% accuracy, accurately extracted but minor formatting and numerical discrepancies.

Mistral-OCR Output:

Only unified agent-based and agentless platform for code-to-cloud protection
Pre-Runtime Protection
o	Shift Left
o	Image Assessment
o	IaC and Container Scanning
o	RBAC
o	APIs
o	Automatic Remediations
Agent-Based Runtime Protection
o	VM
o	Containers
o	Serverless
o	Hybrid Cloud
o	ML
o	IOA
o	NGAV
o	EDR
o	Vulnerability Management
o	Managed Hunting
Agentless Cloud Security
o	Visibility
o	Misconfigurations
o	Indicators of Attack
o	Applications
o	CIEM and Compliance
Additional Notes:
o	CI (Continuous Integration) and CD (Continuous Deployment) are mentioned with "Build time" and "Runtime" respectively.
o	The platform emphasizes cloud adversary threat intelligence powering security and compliance.        

ChatGPT Observation: Slightly lower (92% accuracy), also extracted accurately, but had minor formatting inconsistencies.

Winner: Shakti-4B, marginally better due to higher precision and contextual retention.


4. Table Content Extraction

Input Image & Prompt given: Extract content of the table in proper format

Shakti-4B Output:

| Features                    | Shakti-LLM Specification           |
|-----------------------------|------------------------------------|
| Model Parameters            | 2.5 Billion                        |
| Layers                      | 16                                 |
| Model Dimension             | 4096                               |
| FFN Dimension               | 4096                               |
| Attention Heads             | 32                                 |
| Key/Value Heads             | 8                                  |
| Peak Learning Rate          | 3.6e-5                             |
| Activation Function         | SwiGLU                             |
| Vocabulary Size             | 128256                             |
| Positional Embeddings       | RoPE (θ = 500, 000)                |
| GPU Consumption (Raw)       | 9 GB                               |
| GPU Consumption (Quantized) | 4 GB                               |
 
Table 1: Specifications of Shakti-LLM        

ChatGPT Observation: Achieved a structured table extraction with perfect formatting (100% accuracy).

Mistral-OCR Output:

o	Specifications of Shakti-LLM
o	Model Parameters: 2.5 Billion
o	Layers: 16
o	Model Dimension: 4096
o	FFN Dimension: 4096
o	Attention Heads: 32
o	Key/Value Heads: 8
o	Peak Learning Rate: 3.6e-5
o	Activation Function: SwiGLU
o	Vocabulary Size: 128256
o	Positional Embeddings: RoPE (θ = 500,000)
o	GPU Consumption (Raw): 9 GB
o	GPU Consumption (Quantized): 4 GB
        

ChatGPT Observation: Achieved high accuracy (around 90-93%), but the extracted content was overly segmented and formatting less clear.

Winner: Shakti-4B, for better-structured and formatted extraction.


5. Billboard Text (Contextual Extraction)

Input Image & Prompt given: Extract the text and other information from the billboard present in the image in proper format.


Shakti-4B Output:

Text on the Billboard:
o	Brand Name: airtel money
o	Main Message: "Make instant school fees payments With you, always."
o	Call to Action: Dial 1856#
Additional Information:
o	Location: The billboard is located in an urban area, as seen in the background with buildings and a clear blue sky.
o	Image: The image shows a man sitting and using a mobile phone, with a background of a fruit stall.
Other Details:
o	Building in the Background: There is a building with a sign that reads "SSUUBI FASHION DESIGNERS."
o	Vehicle: There is a vehicle (possibly a taxi) in the foreground of the image.        

ChatGPT Observation: Provided extensive contextual details (98% accuracy), with minor formatting error in dialing code.

Mistral-OCR Output:

 Billboard Information
o	Company: Airtel Money
o	Message: Make instant school fees payments with you, always.
o	Contact: Dial 1856#
o	Visual Elements:
A man holding a mobile phone.
Background image of a school building with the name "KAMAKWALWA."
A basket of vegetables, possibly indicating agricultural or rural context.        

ChatGPT Observation:Achieved 95% accuracy, accurate with fewer contextual details and some formatting inconsistencies.

Winner: Shakti-4B, superior contextual and accurate extraction.


6. Conversation Transcript Extraction

Input Image & Prompt given: Extract the text from the conversation in appropriate format.


Shakti-4B Output:

User: How to purchase the license?
AI: Please login into our website, and click the 'Renew' button on the top right corner.
User: OK, Let me try.
AI: DONE, Thanks.
AI: You are welcome!        

ChatGPT Observation: Achieved 95% accuracy, correct dialogue but lacked timestamps.

Mistral-OCR Output:

Conversation:
User (15:27): How to purchase the license?
Assistant (01:38): Please login into our website and click the 'Renew' button on the top right corner.
User (15:28): OK, Let me try.
Assistant (15:29): DONE, Thanks.
Assistant (01:25): You are welcome!
        

ChatGPT Observation: Higher accuracy (98% accuracy), with timestamps and speaker assignment handled better.

Winner: Mistral-OCR, due to superior handling of conversation metadata.


7. Mathematical Formula Extraction

Input Image & Prompt given: Extract formulas present in the image along with their description.


Shakti-4B Output:

1.	State Space Model: [ x'(t) = Ax(t) + Bu(t) ] [ y(t) = Cx(t) + Du(t) ] Description: This is the standard state space model equation. It maps a 1-D input signal ( u(t) ) to an ( N )-D latent state ( x(t) ) before projecting to a 1-D output signal ( y(t) ). The matrices ( A, B, C, ) and ( D ) are parameters learned by gradient descent.

2.	HiPPO Matrix: [ A_{nk} = \begin{cases} (2n + 1)^{1/2}(2k + 1)^{1/2} & \text{if } n > k \ n + 1 & \text{if } n = k \ 0 & \text{if } n < k \end{cases} ] Description: The HiPPO matrix is defined to allow the state ( x(t) ) to memorize the history of the input ( u(t) ). This helps in addressing long-range dependencies in the state space model. The matrix ( A ) is used to improve the performance of the state space model, as demonstrated by the LSSL on the sequential MNIST benchmark.        

ChatGPT Observation: Delivered correct formulas and detailed context, accuracy not explicitly stated but contextually clear.

Mistral-OCR Output:

Formulas and Descriptions
1.	State Space Model Equations:
a.	Description: These equations define a continuous-time latent state model, mapping a 1-D input signal ( u(t) ) to an ( N )-D latent state ( x(t) ), which is then projected to a 1-D output signal ( y(t) ).
b.	Formulas: [ x'(t) = \mathbf{A}x(t) + \mathbf{B}u(t) ] [ y(t) = \mathbf{C}x(t) + \mathbf{D}u(t) ]

2.	HiPPO Matrix:
a.	Description: The HiPPO matrix is a specific type of matrix used in the HiPPO framework for continuous-time memorization. It allows the state ( x(t) ) to memorize the history of the input ( u(t) ).
b.	Formula: [ \mathbf{A}_{nk} = \begin{cases} (2n + 1)^{1/2}(2k + 1)^{1/2} & \text{if } n > k \ n + 1 & \text{if } n = k \ 0 & \text{if } n < k \end{cases} ]        

ChatGPT Observation: Very good formula extraction (95% accuracy), accurately described the formulas clearly.

Winner: Mistral-OCR, slightly superior in detailed extraction and contextual clarity.


8. Rotated/Angled Text Extraction

Input Image & Prompt given: Extract text present in the image accurately.

Shakti-4B Output:

INTRODUCING THE
 
BUILT FROM PASSION
MADE FROM METAL
 
60/1 SERIES        

ChatGPT Observation: Achieved 98% accuracy with minor numerical misinterpretation.

Mistral-OCR Output:

o	Introducing the
o	Built from Passion
o	Made from Metal
o	6061 Series        

ChatGPT Observation: Lower accuracy (95%) with minor formatting errors.

Winner: Shakti-4B, with slightly higher accuracy.


9. Historical Headlines Extraction

Input Image & Prompt given: Analyze the newspaper and extract important headlines from it.


Shakti-4B Output:

The important headlines from the newspaper are:
 
1. "DEALER FARES SHOCK"
2. "OAS Chief Captured In Oran"
3. "Benny Fights For His Life"
4. "London Palladium"
        

ChatGPT Observation: Achieved 90% accuracy, capturing major headlines but missing some details.

Mistral-OCR Output:

Headlines:
1.	Main Headline:
a.	DEARER FARES SHOCK
b.	Bus, Tube And Rail Up, And 2s. In £ On Your Season

2.	Other Headlines:
a.	OAS Chief Captured In Oran
b.	Benny Fights For His Life
c.	At-A-Glance Guide: Your Rail Fares
d.	The Lucky Day
e.	The New Look
f.	The Queen's Day
g.	The Royal Visit        

ChatGPT Observation: Higher accuracy (95% accuracy), more comprehensive extraction of headlines.

Winner: Mistral-OCR, superior in comprehensiveness and accuracy.


10. Mathematical Formula Extraction

Input Image & Prompt given: Extract the text from the image.


Shakti-4B Output:

The text from the image is:
"removing distortion and recovering the fronto-parallel view of text with different approaches in the literature, the image rectification is carried out with single image. This work a new technique, which is capable of recovering a new single image."        

ChatGPT Observation: Performed with detailed accuracy but had minor numerical and symbol formatting issues.

Mistral-OCR Output:

Here is the text extracted from the image
---
We propose in this work a new technique, which is capable of removing distortion and recovering the fronto-parallel view of text with single image. Different from reported approaches in the literature, the image rectification is carried out using character boundary an
---        

ChatGPT Observation: Achieved 95% accuracy, extracted formulas clearly and comprehensively, with slight advantages in structured presentation.

Winner: Mistral-OCR, slightly superior in handling mathematical notations and symbols clearly.


Once again sharing Shakti 4B benchmark from our paper on Shakti VLLMs (1B & 4B) published on arXiv

You can find the paper on arXiv here -> https://arxiv.org/abs/2502.17092

While Mistral OCR has made commendable strides in OCR technology, our benchmarking indicates that Shakti 4B demonstrates notable advantages in accuracy, versatility, and contextual understanding. Developed by SandLogic Technologies, Shakti 4B addresses the diverse needs of organizations, providing reliable and comprehensive OCR capabilities tailored to modern document digitization requirements.

Note: The benchmarking data referenced is based on internal studies conducted by SandLogic Technologies.

Satish Grampurohit

Cofounder & CEO, Cogniquest | Corporate Leader | Board Advisor | Transforming Document Intelligence

1 周

Very impressive Kamalakar. Congratulations. We use various OCRs and would love to explore Shakti 4B.

Sreenivas Kairi

Technologist, Entrepreneur (Active Public Trust Clearance)

1 周

Great product Kamalakar Devaki

Nixon Puthur

Director at Trainingguru.co.in

1 周

Interesting features of OCR technology & related interpretations Kamalakar Devaki

Manish Kishore

#Advisor #Speaker #TechnologyLeader #Digitaltransformation #Designthinking #Techinnovation #Manufacturing #Ecommerce #Lifescience #Healthcare #Digitalhealth

1 周

Impressive, can we test in sandbox

Abdul Saleem

Founder & CEO Nummero & Ginee Marktech | IIMB

1 周

Impressive strides in OCR technology indeed! It's fascinating to witness the advancements in accuracy and speed, especially with complex document types. The ability to handle diverse scenarios like structured tables and angled text extraction is crucial for enterprise-grade OCR solutions. Excited to see how Shakti 4B continues to excel in supporting document digitization needs across industries. Keep up the great work!

要查看或添加评论,请登录

Kamalakar Devaki的更多文章

社区洞察