Shakti-4B: The Multi-Modal AI Model Powering  Vision-Language Intelligence

Shakti-4B: The Multi-Modal AI Model Powering Vision-Language Intelligence

At SandLogic, we've consistently pushed the boundaries of AI innovation across diverse modalities. With the successful launch of Shakti-1B and its impressive performance in text-based tasks, we’ve now taken a significant leap into the realm of Vision-Language Models (VLMs) with our latest innovation — Shakti-4B.

Shakti-4B isn't just another multi-modal model; it represents a shift towards truly perceptive AI systems capable of comprehending text, images, charts, videos, and documents with unprecedented precision. Built on the foundation of Shakti's core architecture, Shakti-4B delivers exceptional performance across multiple benchmarks, positioning itself as a robust contender in the VLM space.

In this article, we’ll deep-dive into the architecture, performance, benchmarks, innovations, and real-world applications of Shakti-4B.


Why Multi-Modal AI Matters

The world we live in is inherently multi-modal — we perceive the world not just through text but through images, sounds, and videos. Traditional language models excel at text comprehension, but the next evolution of AI requires models to "see" and "understand" visual content.

Vision-Language Models (VLMs) bridge this gap, making AI applications more intuitive, capable, and human-like in their understanding. From medical image analysis and autonomous navigation to intelligent document processing, VLMs are redefining possibilities across industries.


Shakti-4B: Architecture and Innovations

Shakti-4B leverages the same underlying principles that define the Shakti LLM family but introduces critical enhancements to handle vision-language tasks:

  1. Variable Grouped Query Attention (VGQA): Adapts attention mechanisms dynamically across text and image embeddings, ensuring efficient handling of long contexts.
  2. SwiGLU Activations: Improves model interpretability and robustness when dealing with complex multi-modal data.
  3. Rotary Positional Embeddings (RoPE): Ensures the model effectively processes long sequences across vision and text domains.
  4. Cross-Modal Fusion Layer: Seamlessly fuses information from different modalities for better contextual understanding.
  5. HALUMON Framework Integration: Our proprietary hallucination detection and mitigation framework ensures the model maintains high accuracy and reliability when interacting with real-world data.

These architectural innovations empower Shakti-4B to operate efficiently across domains like healthcare, finance, manufacturing, and customer support while running effectively even on edge devices.


Benchmark Performance: How Shakti-4B Stacks Up

We rigorously tested Shakti-4B against some of the most well-known VLMs, including InternV2L-4B, Phi-3-Vision 4B, MiniCPM-V 2.6 8B, and Qwen2VL-7B. Here’s a detailed look at the results:

Shakti-4B-VLM Results and Comparision with

Key Takeaways from the Benchmark Results

1?? Document Intelligence: A Clear Leader

  • Shakti-4B achieves 92.92 in DocVQ, outperformed only by Qwen2VL (94.5).
  • Real-World Impact: Ideal for document-heavy sectors like insurance, healthcare, and legal services.

2?? Chart and Text Visual QA: A Competitive Edge

  • Scoring 85.28 in ChartQ and 85.56 in TextVQ, Shakti-4B sets a new standard for visual text comprehension.
  • Real-World Impact: Perfect for financial dashboards and technical reports.

3?? Vision-Language Fusion: Strong General Performance

  • With 59.78 in MMMU and 71.18 in RealWorldQA, Shakti-4B outperforms similarly-sized models like InternV2L-4B and Phi-3-Vision 4B.
  • Real-World Impact: Useful for customer support applications with image-based queries.

4?? Hallucination

  • Shakti-4B scored 47.9 in HallusionBench, reflecting the effectiveness of HALUMON in minimizing false positive responses.
  • Real-World Impact: Essential for critical sectors like healthcare diagnostics.


Comparative Analysis: Shakti-4B vs QWEN2.5 VL-7B - Breaking Down the Performance Metrics

In our ongoing exploration of emerging AI models, we've conducted an in-depth analysis comparing two notable contenders: Shakti-4B and QWEN2.5 VL-7B. Our evaluation, leveraging GPT-4 as an objective judge, reveals fascinating insights into their respective capabilities and potential applications in real-world scenarios.

Evaluation Methodology

Our analysis employed a sophisticated evaluation approach:

  1. Automated Judgment: We utilized GPT-4 as an impartial judge to assess model outputs where ground truth wasn't readily available.
  2. Accuracy Metrics: Performance was measured through a combination of: Direct comparison with ground truth when available GPT-4's evaluation of response quality and accuracy Human verification for complex cases
  3. Scoring System: Accuracy percentages were calculated based on how closely each model's output aligned with expected answers or acceptable response patterns.

Performance Breakdown with Real-World Examples

Let's examine how these models performed across different tasks with specific examples:

1. Image Understanding and Summarization

Input Image


Prompt Given: Summarize the key points from the image in 3-5 sentences

Shakti-4B (80% Accuracy):

The image captures the section of a research paper focusing on the evaluation and results of an experiment related to natural language processing tasks, likely involving models of varying sizes and architectures. The sections are headed by "C. Evaluation Details" and "D. Result Details," detailing methodologies, experiments conducted, results obtained, and discussion of these outcomes. Tables within these sections presumably present comparative metrics and performance analyses of the models evaluated, comparing datasets and training metrics. The document likely provides an analysis of the performance of different models on specific tasks and highlights the insights from the experimentation conducted.        

ChatGPT Observation: Captures the general theme of evaluation and results but lacks specific details from the text, such as dataset names and exact evaluation methods.

QWEN2.5 VL-7B (60% Accuracy):

The image appears to be a page from a scientific paper discussing the evaluation details of a machine learning model for generating captions for images. The key points include:

Evaluation Details: The study evaluates the performance of a machine learning model on generating captions for images, using the Flickr30k dataset. The evaluation is conducted through two main tasks: (a) captioning, where the model generates a sentence that describes the image, and (b) caption quality, which assesses the relevance and coherence of the generated captions.
Model Evaluation: The model's performance is evaluated based on the number of correct words in the generated captions compared to the ground truth captions. The evaluation also considers the model's ability to generate captions with high quality, as measured by human feedback.
Results: The results show that the model performs well on both tasks, achieving high accuracy in generating relevant and coherent captions. The study also highlights the importance of using diverse datasets and human feedback for evaluating the quality of generated captions.
        

ChatGPT Observation: Incorrectly assumes the paper is about image captioning and the Flickr30k dataset, which is not mentioned in the image. Contains fabricated details, reducing accuracy.


2. Technical Documentation Processing

Input Image

Prompt Given: Convert the content in the image to markdown

Shakti-4B (100% Accuracy):

```markdown
# 1. INTRODUCTION

## (1.1) The theta function. 
Let \( f : \mathbb{R}^n \to \mathbb{R}_+ \) be a positive definite quadratic form, so
\[ 
f(x) = \langle Bx, x \rangle \quad \text{for } x \in \mathbb{R}^n,
\]
where \( B \) is an \( n \times n \) positive definite matrix and \(\langle , \rangle\) is the standard scalar product in \(\mathbb{R}^n\). We consider the problem of efficient computing (approximating) the sum

### (1.1.1)
\[
\Theta(B) = \sum_{x \in \mathbb{Z}^n} e^{-f(x)} = \sum_{x \in \mathbb{Z}^n} e^{-\langle Bx, x \rangle},
\]

where \(\mathbb{Z}^n \subseteq \mathbb{R}^n\) is the standard integer lattice. More generally, for a given point \( y \in \mathbb{R}^n \), we want to efficiently compute (approximate) the sum

### (1.1.2)
\[
\Theta(B, y) = \sum_{x \in \mathbb{Z}^n} e^{-f(x-y)} = \sum_{x \in \mathbb{Z}^n} e^{-\langle B(x-y), x-y \rangle}.
\]
```
        

ChatGPT Observation: Perfect Markdown conversion preserving all mathematical expressions accurately, including complex LaTeX formatting for mathematical equations and proper structural hierarchy.

QWEN2.5 VL-7B (70% Accuracy):

The image you provided is a mathematical text discussing the theta function, which is a sum over the integer lattice in ( \mathbb{R}^n ) of the exponential of a quadratic form. Here's a breakdown of the content:

1. Introduction
(1.1) The theta function
Definition of the Theta Function: The theta function ( \Theta(B) ) is defined as the sum of the exponential of a quadratic form ( f(x) = \langle Bx, x \rangle ) over all points ( x \in \mathbb{Z}^n ), where ( \mathbb{Z}^n ) is the standard integer lattice in ( \mathbb{R}^n ). The quadratic form ( f(x) ) is given by ( f(x) = \langle Bx, x \rangle ), where ( B ) is an ( n \times n ) positive definite matrix and ( \langle \cdot, \cdot \rangle ) denotes the standard scalar product in ( \mathbb{R}^n ).
Generalization: The problem can be generalized to compute (or approximate) the sum ( \Theta(B, y) ) for a given point ( y \in \mathbb{R}^n ), where ( \Theta(B, y) = \sum_{x \in \mathbb{Z}^n} e^{-f(x-y)} = \sum_{x \in \mathbb{Z}^n} e^{-\langle B(x-y), x-y \rangle} ).
This discussion sets up the context for understanding the theta function and its generalizations, which are important in various fields such as number theory, combinatorics, and statistical physics.        

ChatGPT Observation: Provided good content structure but lacked proper LaTeX syntax for inline and block equations, affecting the technical accuracy of the mathematical content.


3. Code Generation and Documentation

Input Image

Prompt Given: Write code for a multiplication table program

Shakti-4B (100% Accuracy):

To create the Python code for printing the multiplication table for a given number \( N \) using a for loop, you can use the following code:

```python
def multiplicationTable(N):
    for i in range(1, 11):  # Generates numbers from 1 to 10
        print(i * N, end=" ")  # Prints the multiplication result spaced 

N = int(input("Enter a number: "))  # Input from the user
multiplicationTable(N)
```

Explanation of the code:
1. **`def multiplicationTable(N):`** - This defines a function `multiplicationTable` that takes an integer \( N \) as its input.
2. **`for i in range(1, 11):`** - This loop iterates over the numbers from 1 to 10 (inclusive).
3. **`print(i * N, end=" ")`** - This prints the product of \( N \) and the current number \( i \), separated by a space.
4. **`N = int(input("Enter a number: "))`** - This prompts the user to input the number \( N \).
5. **`multiplicationTable(N)`** - This calls the function `multiplicationTable` with the inputted number \( N \).

This will print the multiplication table for the number \( N \) in a single line with each product separated by a space.        

ChatGPT Observation: Provides correct code, and explains the code step by step

QWEN2.5 VL-7B (85% Accuracy):

To solve the problem of printing the multiplication table of a given number N using a for loop in Python, you can use the range function to iterate over the numbers from 1 to N. Here's the complete code:


def multiplicationTable(N):
    # Iterate over the range from 1 to N (inclusive)
    for i in range(1, N + 1):
        print(i * N, end=" ")

# Example usage:
multiplicationTable(5)
This code defines a function multiplicationTable that takes an integer N as input and prints its multiplication table. The range(1, N + 1) generates numbers from 1 to N, and the print statement multiplies each number by N and prints it followed by a space.

When you run this code with N = 5, it will output:        

ChatGPT Observation: Provides correct code, but does not explain the code step by step


4. Visual Information Processing

Input Image


Prompt Given: Output the color and number of boxes in a visual sequence

Shakti-4B (70% Accuracy):

output the colour and number of each box in the image
1. Blue - 9
2. Blue - 7
3. Blue - 8
4. Purple - 4
5. Pink - 5
6. Green - 6
7. Orange - 1
8. Yellow - 2
9. Yellow - 3
10. Red - 0        

ChatGPT Observation: Provides a structured and concise list of colors and numbers. If the exact color shades were visible, some color have been missinterepted.

QWEN2.5 VL-7B (70% Accuracy):

output the colour and number of each box in the image
The image shows a stack of colorful blocks with numbers on them. Here is the breakdown of the colors and numbers:

Top block: Blue, Number 9
Second block from top: Light blue, Number 7
Third block from top: Teal, Number 8
Fourth block from bottom: Purple, Number 5
Fifth block from bottom: Pink, Number 6
Bottom block: Red, Number 0
Second block from bottom: Orange, Number 1
Third block from bottom: Yellow, Number 2
Fourth block from bottom: Green, Number 3
So, the sequence is:

Blue (9)
Light blue (7)
Teal (8)
Purple (5)
Pink (6)
Red (0)
Orange (1)
Yellow (2)
Green (3)
        

ChatGPT Observation: Provides a descriptive breakdown, but inconsistencies in color names (e.g., "Light blue," "Teal") and order of numbers could introduce errors.


5. Practical Application Testing

Input Image


Prompt Given: Where can i park my car?

Shakti-4B (100% Accuracy):

Based on the image, you can park your car at slot B1, 47, since it is marked as available in green text. The slot B2, 52 is unavailable in red text.        

ChatGPT Observation: Understands the image and provided the correct answer according to the prompt

QWEN2.5 VL-7B (50% Accuracy):

Based on the information provided in the image, you have two available parking slots:
Slot B1 with 47 available spaces.
Slot B2 with 52 available spaces.
You can choose either of these slots to park your car.        

ChatGPT Observation: Miss Interpret the NOT AVALABLE as Available spaces.


6. JSON Generation from Unstructured document based on Image

Input Image

Prompt Given: Extract the details in json format and explain

Shakti-4B (60% Accuracy):

```json
{
  "Training Stages": [
    {
      "Stage": "Stage 1",
      "Number of steps": 1000,
      "Learning rate (max, min)": "(10^-4, 10^-4)",
      "Batch size": null,
      "Sequence length": 364^2,
      "Max image resolution": "364^2→728^2"
    },
    {
      "Stage": "Stage 2",
      "Number of steps": 3000,
      "Learning rate (max, min)": "(10^-4, 10^-4)",
      "Batch size": null,
      "Sequence length": 10K,
      "Max image resolution": "728^2→1092^2"
    },
    {
      "Stage": "Stage 3",
      "Number of steps": 1500,
      "Learning rate (max, min)": "(10^-4, 0)",
      "Batch size": 1024,
      "Sequence length": 10K,
      "Max image resolution": "1092^2→1456^2"
    },
    {
      "Stage": "SFT",
      "Number of steps": 5000,
      "Learning rate (max, min)": "(5x10^-5, 0)",
      "Batch size": 1820^2,
      "Sequence length": 1820^2
    }
  ],
  "Backbones training": [
    "Frozen in Stage 1",
    "LoRA in Stage 2",
    "LoRA in Stage 3"
  ],
  "Data for Training": [
    {
      "Stage 1": "OBELICS, LAION COCO, PDFA"
    },
    {
      "Stage 2": "OBELICS, LAION COCO, PDFA, Docmatix, Websight, LNQA"
    },
    {
      "Stage 3": "PDFA, Docmatix, Websight, LNQA"
    },
    {
      "SFT": "The Cauldron, PixelProse, ChartGemma"
    }
  ]
}
```

**Process Explained:**
- **Stage 1:** Initial training with a learning rate of \(10^{-4}\), a sequence length of 364^2 for image resolution 364^2→728^2, and no batch size specified.
- **Stage 2:** Further training with a higher learning rate (\(10^{-4}\)) and a batch size of 1024, expanding sequence length to 10K with images ranging from 728^2 to 1092^2.
- **Stage 3:** Incremented training steps to 1500, using a minimal learning rate of \(10^{-4}\) and batch size of 1024, maintaining sequence length and max image resolution at 10K.
- **SFT Phase:** Final training with 5000 steps, batch size of 1820^2, and a learning rate of \(5 \times 10^{-5}\) focusing on more detailed backbones with datasets like The Cauldron, PixelProse, and ChartGemma.
        

ChatGPT Observation: Incorrect batch size for multiple stages, missing details in dataset assignments, and errors in max image resolution transitions.

QWEN2.5 VL-7B (75% Accuracy):

{
  "training_stages": [
    {
      "stage_name": "Stage 1",
      "steps": 1000,
      "learning_rate": "(10^-4, 10^-4)",
      "batch_size": 1024,
      "sequence_length": "10K",
      "max_image_resolution": "364^2",
      "backbones_training": "Frozen",
      "data": ["OBELICS", "LAION COCO"]
    },
    {
      "stage_name": "Pre-training Stage 2",
      "steps": 3000,
      "learning_rate": "(10^-4, 10^-4)",
      "batch_size": 1024,
      "sequence_length": "10K",
      "max_image_resolution": "728^2 -> 1092^2 -> 1456^2 -> 1820^2",
      "backbones_training": "LoRA",
      "data": ["OBELICS", "LAION COCO", "PDFA"]
    },
    {
      "stage_name": "Stage 3",
      "steps": 1500,
      "learning_rate": "(10^-4, 0)",
      "batch_size": 1024,
      "sequence_length": "10K",
      "max_image_resolution": "1820^2",
      "backbones_training": "LoRA",
      "data": ["OBELICS", "LAION COCO", "PDFA"]
    },
    {
      "stage_name": "SFT",
      "steps": 5000,
      "learning_rate": "(5x10^-5, 0)",
      "batch_size": 1024,
      "sequence_length": "10K",
      "max_image_resolution": "1820^2",
      "backbones_training": "LoRA",
      "data": ["OBELICS", "LAION COCO", "PDFA", "Docmatix", "Websight", "LNQA", "PixelProse", "ChartGemma", "The Cauldron"]
    }
  ]
}Explanation:
Training Stages: The table outlines four different stages of training for the model Idefics3.
Stage 1: This stage involves 1000 steps with a learning rate of (10^{-4}) (constant), using a batch size of 1024, and a sequence length of 10K. The maximum image resolution starts at (364^2) and increases stepwise to (1820^2). The backbones are frozen, and the data used includes OBELICS and LAION COCO.
Pre-training Stage 2: This stage has 3000 steps with a learning rate of (10^{-4}) (constant), using the same batch size and sequence length as Stage 1. The maximum image resolution increases from (728^2) to (1820^2). The backbones are trained using LoRA, and the data includes OBELICS, LAION COCO, and PDFA.
Stage 3: This stage has 1500 steps with a learning rate of (10^{-4}) (constant) decreasing to 0, using the same batch size and sequence length as previous stages. The maximum image resolution remains at (1820^2). The backbones are trained using LoRA, and the data includes OBELICS, LAION COCO, and PDFA.
SFT (Structured Fine-Tuning): This stage has 5000 steps with a learning rate of (5 \times 10^{-5}) (constant) decreasing to 0, using the same batch size and sequence length as previous stages. The maximum image resolution remains at (1820^2). The backbones are trained using LoRA, and the data includes OBELICS, LAION COCO, PDFA, Docmatix, Websight, LNQA, PixelProse, ChartGemma, and The Cauldron.
This structure helps in gradually increasing the complexity of the training process, starting from simpler tasks and moving towards more complex ones, ensuring that the model learns progressively and effectively.        

ChatGPT Observation: Overestimates dataset usage in later stages and mislabels certain training stages but captures key structural elements better than Shakti.


Key Takeaways for Industry Professionals

  1. Specialized Task Handling: Shakti-4B demonstrates particular strength in tasks requiring precise technical understanding and structured output generation, as evidenced by its superior performance in code documentation and technical content processing.
  2. Documentation Capabilities: While both models can generate functional code, Shakti-4B's ability to provide comprehensive documentation makes it particularly valuable for development teams requiring detailed explanations.
  3. Visual Processing Balance: Both models show similar capabilities in basic visual information processing, though they differ in their approach to describing and categorizing visual elements.

Implications for Enterprise Applications

These findings have significant implications for organizations considering AI implementation:

  • For technical documentation and code generation workflows, Shakti-4B's superior performance in providing detailed explanations could enhance developer productivity.
  • In visual processing applications, either model could be suitable, with the choice depending on whether structured listing or descriptive analysis is more valuable for the specific use case.
  • For applications requiring high accuracy in status interpretation (such as monitoring systems), Shakti-4B's superior performance in this area makes it a more reliable choice.

Evaluation Insights

The use of GPT-4 as an evaluation tool provided several key insights:

  1. Objective Assessment: GPT-4's role as a judge helped eliminate potential human bias in evaluating model outputs, particularly in subjective tasks like summarization and description.
  2. Consistency in Scoring: The automated evaluation process ensured consistent scoring across all tasks, making the comparison more reliable.
  3. Detail-Oriented Analysis: GPT-4's ability to analyze nuanced differences in outputs helped identify subtle strengths and weaknesses in each model's performance.



Real-World Applications: Where Shakti-4B Excels

?? 1. Intelligent Document Processing

Shakti-4B's 92.92 score on DocVQ makes it ideal for automating document-heavy processes in finance, legal, and healthcare industries.

Use Case:

  • Automating insurance claim form processing.
  • Extracting critical insights from medical imaging and patient records.


?? 2. Chart and Graph Analysis

With 85.28 on ChartQ, Shakti-4B can assist in deriving insights from complex charts and infographics.

Use Case:

  • Powering LexIQ, our LLM assistant for PowerBI.
  • Generating business performance reports with visual insights.


?? 3. Customer Interaction Insights

Its strong performance across RealWorldQA (71.18) and TextVQ (85.56) makes Shakti-4B highly effective in customer-facing applications.

Use Case:

  • Integrating with Lingo to provide sentiment analysis from customer images, screenshots, and documents.
  • Enhancing call center QA automation by analyzing customer interactions involving text, images, and voice.


?? 4. Assistive Technologies

The model's proficiency in OCRBench (840) and InfoVQA (77.3) makes it useful for assistive applications like Project Pious, which aids visually impaired individuals in navigating environments.

Use Case:

  • Real-time text extraction from signboards.
  • Context-aware navigation support for visually impaired users.


?? 5. Retail and Inventory Management

Shakti-4B's ability to analyze product images and textual content positions it as an asset in retail automation.

Use Case:

  • Automating product catalog updates.
  • Real-time inventory monitoring using image-based recognition systems.


The Road Ahead: What's Next for Shakti-4B?

While Shakti-4B marks a significant milestone in SandLogic's journey toward advanced multi-modal AI, we remain committed to continuous innovation. The next steps involve:

  1. Expanding Multi-Modal Datasets: Introducing more diverse datasets spanning domains like medical imaging, retail, and video analytics.
  2. Optimizing for Edge AI: Leveraging our proprietary ExSLerate AI co-processor chip (Krsna) to run Shakti-4B models on ultra-low-power devices.
  3. Enhancing Video QA: Training on larger video datasets to boost MTVQA performance.
  4. Enterprise Deployments: Extending Shakti-4B’s capabilities through LingoForge, ensuring secure, on-premise model deployment for enterprises.


What’s Next: The Road to Shakti-8B (MoE Architecture)

We’re not stopping at Shakti-4B. Our next milestone is Shakti-8B, a Mixture of Experts (MoE) model, designed for:

  1. Larger-scale multi-modal applications
  2. More powerful video understanding
  3. Enhanced real-world reasoning
  4. Scalable enterprise deployments

Shakti-8B will bring even greater multi-modal intelligence while maintaining efficiency and enterprise readiness.

Dr. Prem Kishore Vuthandam

"Impossible is Nothing" Docpreneur II Customer Centricity || Healthcare Strategy || Operations Excellence || Digital Business Transformation || Healthcare Startups || Innovations

1 周

Love this

Nixon Puthur

Director at Trainingguru.co.in

1 周

Very interesting to learn about this multi-modal AI with Shakti-4B Kamalakar Devaki

Abdul Saleem

Founder & CEO Nummero & Ginee Marktech | IIMB

1 周

I agree

Abdul Saleem

Founder & CEO Nummero & Ginee Marktech | IIMB

1 周

Exciting to see the advancements in multi-modal AI with Shakti-4B! The potential for transforming industries is immense, especially in enhancing decision-making processes and automating complex tasks. Looking forward to witnessing the impact of this cutting-edge technology across various sectors. #AIInnovation #EnterpriseAI #Shakti4B

要查看或添加评论,请登录

Kamalakar Devaki的更多文章