登录查看更多内容

Multimodal Prompting with Llama 3.2

Suman Biswas

Engineering Leadership, Emerging Tech & AI - Enterprise Architecture | Digital Strategy | Building Responsible AI Platform

发布日期: 2024年10月26日

Introduction to Multimodal Prompting

In the world of advanced AI, multimodal prompting is gaining prominence. This concept involves simultaneous processing and generating responses based on different types of inputs—such as text, images, or both. The ability to handle multimodal data expands the functionality of models beyond simple text generation, enabling them to respond intelligently to visual inputs or a combination of text and visuals.

In this blog, we’ll dive into implementing multimodal prompts using Llama 3.2 and explore a step-by-step approach with Python code to demonstrate how text and image inputs can be processed together.

Getting Started with Environment Setup

Before starting, we need to configure our environment. The load_env function initializes our environment variables.

from dotenv import load_dotenv, find_dotenv

def load_env():
    _ = load_dotenv(find_dotenv())

Implementing Llama 3.2 Multimodal Prompting

We define a function, llama3, which handles requests to the Llama model API. This function sends messages to the model, specifying options like the number of tokens, temperature, and stopping conditions.

import os
import json
import requests

def llama3(messages, model_size=11):
    model = f"meta-llama/Llama-3.2-{model_size}B-Vision-Instruct-Turbo"
    url = f"{os.getenv('DLAI_TOGETHER_API_BASE', 'https://api.together.xyz')}/v1/chat/completions"
    
    payload = {
        "model": model,
        "max_tokens": 4096,
        "temperature": 0.0,
        "stop": ["<|eot_id|>", "<|eom_id|>"],
        "messages": messages
    }
    
    headers = {
        "Accept": "application/json",
        "Content-Type": "application/json",
        "Authorization": f"Bearer {os.getenv('Api_Key')}"
    }
    
    res = json.loads(requests.request("POST", url, headers=headers, data=json.dumps(payload)).content)

    if 'error' in res:
        raise Exception(res['error'])
    
    return res['choices'][0]['message']['content']

This method accepts a list messages containing text or image-based prompts, sends a POST request to the model API, and returns the AI's response.

Displaying Images with the disp_image Function

Next, let’s add functionality for displaying images directly in our code. The disp_image function takes a URL or a local file path and displays the image using matplotlib.

from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt

def disp_image(address):
    if address.startswith("https://") or address.startswith("https://"):
        response = requests.get(address)
        img = Image.open(BytesIO(response.content))
    else:
        img = Image.open(address)
        
    plt.imshow(img)
    plt.axis('off')
    plt.show()

领英推荐

The Journey to LLM Expertise - Part 2: Leading Large…

Data Science Dojo 1 年前

AI Builders Week 2 Highlights, Common Prompting…

Open Data Science Conference (ODSC) 1 个月前

???? The Next Impact Factor

Pascal Biese 1 年前

Text Input-Only Prompting

Let’s begin with a simple text-only prompt where the model is asked a question:

# Example prompt
messages = [
  {"role": "user", "content": "How far is the moon?"}
]

# Invoking the model
response = llama3(messages, 90)
print(response)

Image-Based Prompting

To take it a step further, let’s introduce an image as part of the input. Below is a simple example of describing an image using a URL.

# Image URL
image_url = "https://raw.githubusercontent.com/image.jpeg"

# Create prompt with text and image
messages = [
  {"role": "user", 
   "content": [
      {"type": "text", "text": "Describe the image "}, 
      {"type": "image_url", "image_url": {"url": image_url}}
   ]
  }
]

# Display the image and get the model’s response
disp_image(image_url)
result = llama3(messages, 90)
print(result)

This example shows how the model can analyze a given image URL and return a text description.

Follow-up Questions About an Image

Multimodal prompts can also handle follow-up queries based on an image. Let’s look at an example where the model first describes the image, and then follows up with a question.


messages = [
  {"role": "user", 
   "content": [
      {"type": "text", "text": "Describe the image"}, 
      {"type": "image_url", "image_url": {"url": image_url}}]
  },
  {"role": "assistant", "content": result},  
  {"role": "user", "content": "How many of them are having hat?"}
]

# Invoke the model with a follow-up question
result = llama3(messages, 90)
print(result)

This setup first asks the model to describe the image, then follows up with a specific question asking about details in the image. This example shows how multimodal prompts can allow for iterative and context-based querying.

Conclusion

Multimodal prompting is an essential tool in the AI toolkit that allows models to work with both text and images, enhancing their flexibility. Whether you’re analyzing documents, and images, or handling complex queries, multimodal models like Llama 3.2 can handle these challenges efficiently.

Through this blog, we’ve covered how to set up and invoke multimodal prompts, from simple text queries to more complex image-based interactions. With a bit of practice, you can leverage these powerful tools for your projects.

https://colab.research.google.com/drive/1e263LHXUGXZcB-K9R8ppcs4xjMqlXUlG?usp=sharing

Anas Qatanani

I Help Small to Medium Businesses Automate their Workflow & Gain More Time ? I Build Al-Driven Solutions ? Founder of AI-Driven?

4 个月

Suman Biswas, combining modalities unlocks creative possibilities. Insightful post

1 次回应

要查看或添加评论，请登录

Suman Biswas的更多文章

DeepSeek R1: Redefining AI with Reasoning, Learning, and Accessibility

2025年1月29日

DeepSeek R1: Redefining AI with Reasoning, Learning, and Accessibility

The AI research landscape has been buzzing with excitement over the release of DeepSeek R1, a powerful new large…

3 条评论
Function Calling with Large Language Models (LLMs)

2024年10月28日

Function Calling with Large Language Models (LLMs)

Introduction to Function Calling in LLMs Function calling within large language models is a powerful feature that…
Hypothetical Document Embeddings (HyDE)

2024年8月17日

Hypothetical Document Embeddings (HyDE)

Introduction Hypothetical Document Embeddings (HyDE) is a cutting-edge technique that extends the utility of…

1 条评论
AI Agents: The Future of Generative AI

2024年7月29日

AI Agents: The Future of Generative AI

2024 will be the year of AI agents. So, what are AI agents? To explain this, we need to look at the various shifts in…

3 条评论
Understanding Reinforcement Learning from Human Feedback (RLHF): A Practical Guide

2024年1月21日

Understanding Reinforcement Learning from Human Feedback (RLHF): A Practical Guide

Reinforcement Learning (RL) is a cornerstone of modern artificial intelligence, teaching machines to make decisions by…

5 条评论
Exploring the Power of Vector Databases and Embeddings in Enhancing Large Language Models

2023年11月17日

Exploring the Power of Vector Databases and Embeddings in Enhancing Large Language Models

In the rapidly advancing field of artificial intelligence, two key technologies have become essential: vector databases…
Leveraging LLMs for Intuitive Interactions with Enterprise SQL Databases

2023年11月5日

Leveraging LLMs for Intuitive Interactions with Enterprise SQL Databases

The backbone of any enterprise is its data, and SQL databases have long been the standard for storing this invaluable…

6 条评论
Exploring the Power of LLMs in Supervised Learning

2023年10月29日

Exploring the Power of LLMs in Supervised Learning

Language Models (LLMs) are more than just text generators; they are intelligent companions for your supervised learning…
Big Data & Agile

2016年2月17日

Big Data & Agile

Unraveling the Significance of Big Data and Agile: Innovation and Motivation In today's fast-paced digital landscape…

See all articles

Multimodal Prompting with Llama 3.2

Suman Biswas

Engineering Leadership, Emerging Tech & AI - Enterprise Architecture | Digital Strategy | Building Responsible AI Platform

领英推荐

Suman Biswas的更多文章

社区洞察

其他会员也浏览了

?? Infinite Text Input? This changes everything.

?? How to Expand LLMs Memory

A Beginner's Guide to ggplot2, Deep Reinforcement Learning, and Innovative AI Research Labs

Issue #194 - THE ML ENGINEER ??

Positive Thinking Company Newsletter November 2023

Build a Retrieval Augmented System (RAG) system in just 4 lines of code!

Top Trending AI tools for 2023

Mastering Logic for AI - Converting Natural Language Statements to Propositional Logic

How to Predict AI vs Human-Written Essays: Hackathon Challenge Solution

Roadmap of skills required to create AI Agent

领英推荐

Suman Biswas的更多文章

DeepSeek R1: Redefining AI with Reasoning, Learning, and Accessibility

Function Calling with Large Language Models (LLMs)

Hypothetical Document Embeddings (HyDE)

AI Agents: The Future of Generative AI

Understanding Reinforcement Learning from Human Feedback (RLHF): A Practical Guide

Exploring the Power of Vector Databases and Embeddings in Enhancing Large Language Models

Leveraging LLMs for Intuitive Interactions with Enterprise SQL Databases

Exploring the Power of LLMs in Supervised Learning

Big Data & Agile

社区洞察

其他会员也浏览了

?? Infinite Text Input? This changes everything.

?? How to Expand LLMs Memory

A Beginner's Guide to ggplot2, Deep Reinforcement Learning, and Innovative AI Research Labs

Issue #194 - THE ML ENGINEER ??

Positive Thinking Company Newsletter November 2023

Build a Retrieval Augmented System (RAG) system in just 4 lines of code!

Top Trending AI tools for 2023

Mastering Logic for AI - Converting Natural Language Statements to Propositional Logic

How to Predict AI vs Human-Written Essays: Hackathon Challenge Solution

Roadmap of skills required to create AI Agent