Shakti-4B: The Multi-Modal AI Model Powering Vision-Language Intelligence
Kamalakar Devaki
Founder & CEO at SandLogic | AI Innovation Leader | Forbes Select 200 | Asia Business Leader of the Year
At SandLogic, we've consistently pushed the boundaries of AI innovation across diverse modalities. With the successful launch of Shakti-1B and its impressive performance in text-based tasks, we’ve now taken a significant leap into the realm of Vision-Language Models (VLMs) with our latest innovation — Shakti-4B.
Shakti-4B isn't just another multi-modal model; it represents a shift towards truly perceptive AI systems capable of comprehending text, images, charts, videos, and documents with unprecedented precision. Built on the foundation of Shakti's core architecture, Shakti-4B delivers exceptional performance across multiple benchmarks, positioning itself as a robust contender in the VLM space.
In this article, we’ll deep-dive into the architecture, performance, benchmarks, innovations, and real-world applications of Shakti-4B.
Why Multi-Modal AI Matters
The world we live in is inherently multi-modal — we perceive the world not just through text but through images, sounds, and videos. Traditional language models excel at text comprehension, but the next evolution of AI requires models to "see" and "understand" visual content.
Vision-Language Models (VLMs) bridge this gap, making AI applications more intuitive, capable, and human-like in their understanding. From medical image analysis and autonomous navigation to intelligent document processing, VLMs are redefining possibilities across industries.
Shakti-4B: Architecture and Innovations
Shakti-4B leverages the same underlying principles that define the Shakti LLM family but introduces critical enhancements to handle vision-language tasks:
These architectural innovations empower Shakti-4B to operate efficiently across domains like healthcare, finance, manufacturing, and customer support while running effectively even on edge devices.
Benchmark Performance: How Shakti-4B Stacks Up
We rigorously tested Shakti-4B against some of the most well-known VLMs, including InternV2L-4B, Phi-3-Vision 4B, MiniCPM-V 2.6 8B, and Qwen2VL-7B. Here’s a detailed look at the results:
Key Takeaways from the Benchmark Results
1?? Document Intelligence: A Clear Leader
2?? Chart and Text Visual QA: A Competitive Edge
3?? Vision-Language Fusion: Strong General Performance
4?? Hallucination
Comparative Analysis: Shakti-4B vs QWEN2.5 VL-7B - Breaking Down the Performance Metrics
In our ongoing exploration of emerging AI models, we've conducted an in-depth analysis comparing two notable contenders: Shakti-4B and QWEN2.5 VL-7B. Our evaluation, leveraging GPT-4 as an objective judge, reveals fascinating insights into their respective capabilities and potential applications in real-world scenarios.
Evaluation Methodology
Our analysis employed a sophisticated evaluation approach:
Performance Breakdown with Real-World Examples
Let's examine how these models performed across different tasks with specific examples:
1. Image Understanding and Summarization
Input Image
Prompt Given: Summarize the key points from the image in 3-5 sentences
Shakti-4B (80% Accuracy):
The image captures the section of a research paper focusing on the evaluation and results of an experiment related to natural language processing tasks, likely involving models of varying sizes and architectures. The sections are headed by "C. Evaluation Details" and "D. Result Details," detailing methodologies, experiments conducted, results obtained, and discussion of these outcomes. Tables within these sections presumably present comparative metrics and performance analyses of the models evaluated, comparing datasets and training metrics. The document likely provides an analysis of the performance of different models on specific tasks and highlights the insights from the experimentation conducted.
ChatGPT Observation: Captures the general theme of evaluation and results but lacks specific details from the text, such as dataset names and exact evaluation methods.
QWEN2.5 VL-7B (60% Accuracy):
The image appears to be a page from a scientific paper discussing the evaluation details of a machine learning model for generating captions for images. The key points include:
Evaluation Details: The study evaluates the performance of a machine learning model on generating captions for images, using the Flickr30k dataset. The evaluation is conducted through two main tasks: (a) captioning, where the model generates a sentence that describes the image, and (b) caption quality, which assesses the relevance and coherence of the generated captions.
Model Evaluation: The model's performance is evaluated based on the number of correct words in the generated captions compared to the ground truth captions. The evaluation also considers the model's ability to generate captions with high quality, as measured by human feedback.
Results: The results show that the model performs well on both tasks, achieving high accuracy in generating relevant and coherent captions. The study also highlights the importance of using diverse datasets and human feedback for evaluating the quality of generated captions.
ChatGPT Observation: Incorrectly assumes the paper is about image captioning and the Flickr30k dataset, which is not mentioned in the image. Contains fabricated details, reducing accuracy.
2. Technical Documentation Processing
Input Image
Prompt Given: Convert the content in the image to markdown
Shakti-4B (100% Accuracy):
```markdown
# 1. INTRODUCTION
## (1.1) The theta function.
Let \( f : \mathbb{R}^n \to \mathbb{R}_+ \) be a positive definite quadratic form, so
\[
f(x) = \langle Bx, x \rangle \quad \text{for } x \in \mathbb{R}^n,
\]
where \( B \) is an \( n \times n \) positive definite matrix and \(\langle , \rangle\) is the standard scalar product in \(\mathbb{R}^n\). We consider the problem of efficient computing (approximating) the sum
### (1.1.1)
\[
\Theta(B) = \sum_{x \in \mathbb{Z}^n} e^{-f(x)} = \sum_{x \in \mathbb{Z}^n} e^{-\langle Bx, x \rangle},
\]
where \(\mathbb{Z}^n \subseteq \mathbb{R}^n\) is the standard integer lattice. More generally, for a given point \( y \in \mathbb{R}^n \), we want to efficiently compute (approximate) the sum
### (1.1.2)
\[
\Theta(B, y) = \sum_{x \in \mathbb{Z}^n} e^{-f(x-y)} = \sum_{x \in \mathbb{Z}^n} e^{-\langle B(x-y), x-y \rangle}.
\]
```
ChatGPT Observation: Perfect Markdown conversion preserving all mathematical expressions accurately, including complex LaTeX formatting for mathematical equations and proper structural hierarchy.
QWEN2.5 VL-7B (70% Accuracy):
The image you provided is a mathematical text discussing the theta function, which is a sum over the integer lattice in ( \mathbb{R}^n ) of the exponential of a quadratic form. Here's a breakdown of the content:
1. Introduction
(1.1) The theta function
Definition of the Theta Function: The theta function ( \Theta(B) ) is defined as the sum of the exponential of a quadratic form ( f(x) = \langle Bx, x \rangle ) over all points ( x \in \mathbb{Z}^n ), where ( \mathbb{Z}^n ) is the standard integer lattice in ( \mathbb{R}^n ). The quadratic form ( f(x) ) is given by ( f(x) = \langle Bx, x \rangle ), where ( B ) is an ( n \times n ) positive definite matrix and ( \langle \cdot, \cdot \rangle ) denotes the standard scalar product in ( \mathbb{R}^n ).
Generalization: The problem can be generalized to compute (or approximate) the sum ( \Theta(B, y) ) for a given point ( y \in \mathbb{R}^n ), where ( \Theta(B, y) = \sum_{x \in \mathbb{Z}^n} e^{-f(x-y)} = \sum_{x \in \mathbb{Z}^n} e^{-\langle B(x-y), x-y \rangle} ).
This discussion sets up the context for understanding the theta function and its generalizations, which are important in various fields such as number theory, combinatorics, and statistical physics.
ChatGPT Observation: Provided good content structure but lacked proper LaTeX syntax for inline and block equations, affecting the technical accuracy of the mathematical content.
3. Code Generation and Documentation
Input Image
Prompt Given: Write code for a multiplication table program
Shakti-4B (100% Accuracy):
To create the Python code for printing the multiplication table for a given number \( N \) using a for loop, you can use the following code:
```python
def multiplicationTable(N):
for i in range(1, 11): # Generates numbers from 1 to 10
print(i * N, end=" ") # Prints the multiplication result spaced
N = int(input("Enter a number: ")) # Input from the user
multiplicationTable(N)
```
Explanation of the code:
1. **`def multiplicationTable(N):`** - This defines a function `multiplicationTable` that takes an integer \( N \) as its input.
2. **`for i in range(1, 11):`** - This loop iterates over the numbers from 1 to 10 (inclusive).
3. **`print(i * N, end=" ")`** - This prints the product of \( N \) and the current number \( i \), separated by a space.
4. **`N = int(input("Enter a number: "))`** - This prompts the user to input the number \( N \).
5. **`multiplicationTable(N)`** - This calls the function `multiplicationTable` with the inputted number \( N \).
This will print the multiplication table for the number \( N \) in a single line with each product separated by a space.
ChatGPT Observation: Provides correct code, and explains the code step by step
QWEN2.5 VL-7B (85% Accuracy):
To solve the problem of printing the multiplication table of a given number N using a for loop in Python, you can use the range function to iterate over the numbers from 1 to N. Here's the complete code:
def multiplicationTable(N):
# Iterate over the range from 1 to N (inclusive)
for i in range(1, N + 1):
print(i * N, end=" ")
# Example usage:
multiplicationTable(5)
This code defines a function multiplicationTable that takes an integer N as input and prints its multiplication table. The range(1, N + 1) generates numbers from 1 to N, and the print statement multiplies each number by N and prints it followed by a space.
When you run this code with N = 5, it will output:
ChatGPT Observation: Provides correct code, but does not explain the code step by step
4. Visual Information Processing
Input Image
Prompt Given: Output the color and number of boxes in a visual sequence
Shakti-4B (70% Accuracy):
output the colour and number of each box in the image
1. Blue - 9
2. Blue - 7
3. Blue - 8
4. Purple - 4
5. Pink - 5
6. Green - 6
7. Orange - 1
8. Yellow - 2
9. Yellow - 3
10. Red - 0
ChatGPT Observation: Provides a structured and concise list of colors and numbers. If the exact color shades were visible, some color have been missinterepted.
QWEN2.5 VL-7B (70% Accuracy):
output the colour and number of each box in the image
The image shows a stack of colorful blocks with numbers on them. Here is the breakdown of the colors and numbers:
Top block: Blue, Number 9
Second block from top: Light blue, Number 7
Third block from top: Teal, Number 8
Fourth block from bottom: Purple, Number 5
Fifth block from bottom: Pink, Number 6
Bottom block: Red, Number 0
Second block from bottom: Orange, Number 1
Third block from bottom: Yellow, Number 2
Fourth block from bottom: Green, Number 3
So, the sequence is:
Blue (9)
Light blue (7)
Teal (8)
Purple (5)
Pink (6)
Red (0)
Orange (1)
Yellow (2)
Green (3)
ChatGPT Observation: Provides a descriptive breakdown, but inconsistencies in color names (e.g., "Light blue," "Teal") and order of numbers could introduce errors.
5. Practical Application Testing
Input Image
Prompt Given: Where can i park my car?
Shakti-4B (100% Accuracy):
Based on the image, you can park your car at slot B1, 47, since it is marked as available in green text. The slot B2, 52 is unavailable in red text.
ChatGPT Observation: Understands the image and provided the correct answer according to the prompt
QWEN2.5 VL-7B (50% Accuracy):
Based on the information provided in the image, you have two available parking slots:
Slot B1 with 47 available spaces.
Slot B2 with 52 available spaces.
You can choose either of these slots to park your car.
ChatGPT Observation: Miss Interpret the NOT AVALABLE as Available spaces.
6. JSON Generation from Unstructured document based on Image
Input Image
Prompt Given: Extract the details in json format and explain
Shakti-4B (60% Accuracy):
```json
{
"Training Stages": [
{
"Stage": "Stage 1",
"Number of steps": 1000,
"Learning rate (max, min)": "(10^-4, 10^-4)",
"Batch size": null,
"Sequence length": 364^2,
"Max image resolution": "364^2→728^2"
},
{
"Stage": "Stage 2",
"Number of steps": 3000,
"Learning rate (max, min)": "(10^-4, 10^-4)",
"Batch size": null,
"Sequence length": 10K,
"Max image resolution": "728^2→1092^2"
},
{
"Stage": "Stage 3",
"Number of steps": 1500,
"Learning rate (max, min)": "(10^-4, 0)",
"Batch size": 1024,
"Sequence length": 10K,
"Max image resolution": "1092^2→1456^2"
},
{
"Stage": "SFT",
"Number of steps": 5000,
"Learning rate (max, min)": "(5x10^-5, 0)",
"Batch size": 1820^2,
"Sequence length": 1820^2
}
],
"Backbones training": [
"Frozen in Stage 1",
"LoRA in Stage 2",
"LoRA in Stage 3"
],
"Data for Training": [
{
"Stage 1": "OBELICS, LAION COCO, PDFA"
},
{
"Stage 2": "OBELICS, LAION COCO, PDFA, Docmatix, Websight, LNQA"
},
{
"Stage 3": "PDFA, Docmatix, Websight, LNQA"
},
{
"SFT": "The Cauldron, PixelProse, ChartGemma"
}
]
}
```
**Process Explained:**
- **Stage 1:** Initial training with a learning rate of \(10^{-4}\), a sequence length of 364^2 for image resolution 364^2→728^2, and no batch size specified.
- **Stage 2:** Further training with a higher learning rate (\(10^{-4}\)) and a batch size of 1024, expanding sequence length to 10K with images ranging from 728^2 to 1092^2.
- **Stage 3:** Incremented training steps to 1500, using a minimal learning rate of \(10^{-4}\) and batch size of 1024, maintaining sequence length and max image resolution at 10K.
- **SFT Phase:** Final training with 5000 steps, batch size of 1820^2, and a learning rate of \(5 \times 10^{-5}\) focusing on more detailed backbones with datasets like The Cauldron, PixelProse, and ChartGemma.
ChatGPT Observation: Incorrect batch size for multiple stages, missing details in dataset assignments, and errors in max image resolution transitions.
QWEN2.5 VL-7B (75% Accuracy):
{
"training_stages": [
{
"stage_name": "Stage 1",
"steps": 1000,
"learning_rate": "(10^-4, 10^-4)",
"batch_size": 1024,
"sequence_length": "10K",
"max_image_resolution": "364^2",
"backbones_training": "Frozen",
"data": ["OBELICS", "LAION COCO"]
},
{
"stage_name": "Pre-training Stage 2",
"steps": 3000,
"learning_rate": "(10^-4, 10^-4)",
"batch_size": 1024,
"sequence_length": "10K",
"max_image_resolution": "728^2 -> 1092^2 -> 1456^2 -> 1820^2",
"backbones_training": "LoRA",
"data": ["OBELICS", "LAION COCO", "PDFA"]
},
{
"stage_name": "Stage 3",
"steps": 1500,
"learning_rate": "(10^-4, 0)",
"batch_size": 1024,
"sequence_length": "10K",
"max_image_resolution": "1820^2",
"backbones_training": "LoRA",
"data": ["OBELICS", "LAION COCO", "PDFA"]
},
{
"stage_name": "SFT",
"steps": 5000,
"learning_rate": "(5x10^-5, 0)",
"batch_size": 1024,
"sequence_length": "10K",
"max_image_resolution": "1820^2",
"backbones_training": "LoRA",
"data": ["OBELICS", "LAION COCO", "PDFA", "Docmatix", "Websight", "LNQA", "PixelProse", "ChartGemma", "The Cauldron"]
}
]
}Explanation:
Training Stages: The table outlines four different stages of training for the model Idefics3.
Stage 1: This stage involves 1000 steps with a learning rate of (10^{-4}) (constant), using a batch size of 1024, and a sequence length of 10K. The maximum image resolution starts at (364^2) and increases stepwise to (1820^2). The backbones are frozen, and the data used includes OBELICS and LAION COCO.
Pre-training Stage 2: This stage has 3000 steps with a learning rate of (10^{-4}) (constant), using the same batch size and sequence length as Stage 1. The maximum image resolution increases from (728^2) to (1820^2). The backbones are trained using LoRA, and the data includes OBELICS, LAION COCO, and PDFA.
Stage 3: This stage has 1500 steps with a learning rate of (10^{-4}) (constant) decreasing to 0, using the same batch size and sequence length as previous stages. The maximum image resolution remains at (1820^2). The backbones are trained using LoRA, and the data includes OBELICS, LAION COCO, and PDFA.
SFT (Structured Fine-Tuning): This stage has 5000 steps with a learning rate of (5 \times 10^{-5}) (constant) decreasing to 0, using the same batch size and sequence length as previous stages. The maximum image resolution remains at (1820^2). The backbones are trained using LoRA, and the data includes OBELICS, LAION COCO, PDFA, Docmatix, Websight, LNQA, PixelProse, ChartGemma, and The Cauldron.
This structure helps in gradually increasing the complexity of the training process, starting from simpler tasks and moving towards more complex ones, ensuring that the model learns progressively and effectively.
ChatGPT Observation: Overestimates dataset usage in later stages and mislabels certain training stages but captures key structural elements better than Shakti.
Key Takeaways for Industry Professionals
Implications for Enterprise Applications
These findings have significant implications for organizations considering AI implementation:
Evaluation Insights
The use of GPT-4 as an evaluation tool provided several key insights:
Real-World Applications: Where Shakti-4B Excels
?? 1. Intelligent Document Processing
Shakti-4B's 92.92 score on DocVQ makes it ideal for automating document-heavy processes in finance, legal, and healthcare industries.
Use Case:
?? 2. Chart and Graph Analysis
With 85.28 on ChartQ, Shakti-4B can assist in deriving insights from complex charts and infographics.
Use Case:
?? 3. Customer Interaction Insights
Its strong performance across RealWorldQA (71.18) and TextVQ (85.56) makes Shakti-4B highly effective in customer-facing applications.
Use Case:
?? 4. Assistive Technologies
The model's proficiency in OCRBench (840) and InfoVQA (77.3) makes it useful for assistive applications like Project Pious, which aids visually impaired individuals in navigating environments.
Use Case:
?? 5. Retail and Inventory Management
Shakti-4B's ability to analyze product images and textual content positions it as an asset in retail automation.
Use Case:
The Road Ahead: What's Next for Shakti-4B?
While Shakti-4B marks a significant milestone in SandLogic's journey toward advanced multi-modal AI, we remain committed to continuous innovation. The next steps involve:
What’s Next: The Road to Shakti-8B (MoE Architecture)
We’re not stopping at Shakti-4B. Our next milestone is Shakti-8B, a Mixture of Experts (MoE) model, designed for:
Shakti-8B will bring even greater multi-modal intelligence while maintaining efficiency and enterprise readiness.
"Impossible is Nothing" Docpreneur II Customer Centricity || Healthcare Strategy || Operations Excellence || Digital Business Transformation || Healthcare Startups || Innovations
1 周Love this
Director at Trainingguru.co.in
1 周Very interesting to learn about this multi-modal AI with Shakti-4B Kamalakar Devaki
Founder & CEO Nummero & Ginee Marktech | IIMB
1 周I agree
Founder & CEO Nummero & Ginee Marktech | IIMB
1 周Exciting to see the advancements in multi-modal AI with Shakti-4B! The potential for transforming industries is immense, especially in enhancing decision-making processes and automating complex tasks. Looking forward to witnessing the impact of this cutting-edge technology across various sectors. #AIInnovation #EnterpriseAI #Shakti4B