Agentic RAG solution for LLMs which can understand PDFs with mutliple images and diagrams

Using Ollama, LlamaIndex & LlamaParse

The tutorial assumes a basic understanding of the following concepts?—?Retrieval Augmented Generation (RAG), Large Language Models.

Introduction

This is a tutorial to build an agentic multimodal RAG solution using open-source Large Language Models (LLMs). We’ll run our models locally for cases when the data-source is complex, image/diagram-heavy PDFs. We’ll first build a RAG solution without an agent to understand the underlying infrastructure which will power our AI agent. Afterwards, we’ll integrate an AI agent to see how it can vastly enhance our system’s ability.

We’ll be extensively using Ollama, LlamaIndex, and LlamaParse to build our solution.?

??Ollama: an open-source tool which is used to run LLMs locally.
??LlamaIndex: a helper framework which makes it easy to integrate LLMs into any solution
??LlamaParse: a sub-framework of LlamaIndex which helps in parsing documents into formats that LLMs can easily understand and use.

Further, we’ll be using a 60-paged PDF containing rich and complex visuals including photos, graphs, charts, diagrams. This will be our primary data-source on which LLM responses will be based.? PDF: https://static.conocophillips.com/files/2023-conocophillips-aim-presentation.pdf

For an intuitive understand of what RAG is, check out a previous Medium article I wrote: ??Retrieval Augmented Generation (RAG)?—?An Intuitive Explanation.

Why build?this?

As complexity data grows, traditional text-based RAG solutions will face limitations. If your data is heavy on images, a text-based RAG solution will typically convert it to text representations (usually using another LLM) and then build an index over it. This is inefficient since there is a limit to the information you can capture when you convert an image to text.?

Moreoever, as user queries become more complex, relying on just the user query is insufficient. We can create queries similar to the one asked by the user and send them to the LLM for a more holistic response. As we’ll see, we can achieve this using AI agents.

This article explains how you can switch to a multimodal, agentic RAG solution where you build the RAG index over both images and texts contained in your data (using PDFs as data-format).?

Here’s a schematic outlining our approach:

Multimodal LLM: The?Brain

A key component of this solution is to use a multimodal LLM?—?a large language model which takes both text and image as input. We‘ll’ use it to interpret PDF images and text for LLM to respond to user-query.

We use Llama3.2-vision model (specifically the 11b-instruct-q4_K_M variant) as our multimodal LLM. We use this variant because instruction-tuned models are trained to better understand and respond to natural language instructions, making them more useful for task-specific interactions. This is an open-source model which Meta recently released (available as 11B and 90B variants).

I chose Llama 3.2–11b-vision because it is open-source and a state-of-the-art in its league. I went with the smaller 11B version since I wanted to run it on my local machine (which has an Nvidia RTX 4080 12GB VRAM). It is a good-enough model for experimentation. You are welcome to try the 90B version and share what you find!

Set-up & Installation

Let’s dive in. First we need to set-up our environment to run our code on a local machine. Heads up, you will need a machine with an NVIDIA GPU having atleast 12 GB VRAM to run the code at tolerable speeds.

The entire project is publicly uploaded to GitHub. Use its ??README to follow installation and set-up steps.

Multimodal RAG

Now comes the interesting part. Let’s implement RAG for our solution. Initially, let’s build a solution without AI agents to better understand the usefulness of each component.

Convert Raw PDFs to Indexable Nodes

Since our PDF contains both textual and image data, we use LlamaParse and LlamaIndex to convert our PDF into series of “nodes” which will contain references to both the text and the images from our PDF. The reason we are converting our data into these “nodes” is because we’ll then be able to create a RAG index over these nodes for efficient information retrieval.

The following schematic explains what we want to do in this step:

Here’s the code to do this.

Load the LlamaCloud API key into Jupyter notebook Python code:

import os
from dotenv import load_dotenv

load_dotenv()
# LLAMA_CLOUD_API_KEY will store the API key to make calls 
# to LlamaIndex and LlamaParse
LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")

Set the following boiler plate code for LlamaParse to run from a Jupyter notebook:

# llama-parse is async-first, running the async code in a notebook 
# requires the use of nest_asyncio
import nest_asyncio

nest_asyncio.apply()

Set a global variable defining the LLM model we will be using:

MODEL = "llama3.2-vision:11b-instruct-q4_K_M"

Parse PDF files:

'''
Parse PDF both in text and markdown mode. Markdown mode allows us to 
capture images as is (as opposed to text mode which uses OCR to convert 
images to text)
'''

from llama_parse import LlamaParse

# Conoco Philips Report file name
pdf_file = "./2023-conocophillips-aim-presentation.pdf"

# set whether LlamaParse should use its internal cache to avoid parsing
# the same document multiple times (and thus credits).
# LlamaParse maintains its cache for 48 hours  
not_from_cache = False 

parser_txt = LlamaParse(verbose=True, 
                        invalidate_cache=not_from_cache, 
                        result_type="text")
parser_md = LlamaParse(verbose=True, 
                        invalidate_cache=not_from_cache, 
                        result_type="markdown")

Note: LlamaParse has 3 modes for parsing?—?Fast, Accurate (default), Premium. LlamaCloud gives you 1000 free credits per day for use. “Accurate” mode costs 1 credit per page, performing OCR and image extraction to parse textual and visual data. This is sufficient for our use-case. If your PDF file is larger and your use-case demands a different parsing mode, you will have to explicitly specify it. Checkout their documentation for more info: https://docs.cloud.llamaindex.ai/llamaparse/output_modes/premium_mode

print(f"Parsing PDF file as text chunks...")
docs_text = parser_txt.load_data(pdf_file)
print(f"Parsing PDF file as markdown chunks and images...")
md_json_objs = parser_md.get_json_result(pdf_file)
md_json_list = md_json_objs[0]["pages"]

# print one json dict as example
print(md_json_list[5]["md"])

# extract images from md_json_list and save in a dir named "llm_images"
image_dicts = parser_md.get_images(md_json_objs, download_path="llm_images")
# print one image dict as example
print(image_dicts[0])

from pathlib import Path

'''
Create a dictionary which maps page numbers to image paths with 
the following format:

Dict:
{
    1: [Path("path/to/image1"), Path("path/to/image2")],    
    2: [Path("path/to/image3"), Path("path/to/image4")],
}
'''
def create_image_index(image_dicts):
    image_index = {}

    for image_dict in image_dicts:
        page_number = image_dict["page_number"]
        image_path = Path(image_dict["path"])
        if page_number in image_index:
            image_index[page_number].append(image_path)
        else:
            image_index[page_number] = [image_path]

    return image_index

from copy import deepcopy
from pathlib import Path

from llama_index.core.schema import TextNode

# create list of text nodes which contain a reference to both parsed text 
# as well as images. References are stored in nodes' metadata.
def get_text_nodes(docs, json_dicts=None, image_dicts=None):
    """Split docs into nodes, by separator."""
    nodes = []

    image_index = create_image_index(image_dicts) if image_dicts is not None else None

    md_texts = [d["md"] for d in json_dicts] if json_dicts is not None else None

    doc_chunks = [c for d in docs for c in d.text.split("---")]
    for idx, doc_chunk in enumerate(doc_chunks):
        page_num = idx + 1
        chunk_metadata = {"page_num": page_num}
        if image_index:
            # store a reference to image
            chunk_metadata["image_paths"] = [str(path) for path in image_index[idx + 1]]
        if md_texts is not None:
            # store a reference to markdown text
            chunk_metadata["parsed_text_markdown"] = md_texts[idx]
        
        # store a reference to plain text
        chunk_metadata["parsed_text"] = doc_chunk
        
        # create TextNode containing references to text and image as metadata
        node = TextNode(
            text=doc_chunk,
            metadata=chunk_metadata,
        )
        nodes.append(node)

    return nodes

# this will create a list of TextNodes. Each such node contains 
# a reference of parsed text and image contained in corresponding PDF page.
text_nodes = get_text_nodes(docs_text, json_dicts=md_json_list, image_dicts=image_dicts)
# print an example node to see its content
print(text_nodes[5].get_content(metadata_mode="all"))

After this, we have our series of TextNodes which can now be indexed for efficient information retrieval.

Build the?Index

Once the text nodes are ready, we feed these nodes into a simple in-memory vector store. LlamaIndex provides a nice VectorStoreIndex class which takes care of building the index for us. Using it, we can retrieve the embeddings and nodes in-memory for efficient retrieval during LLM response generation.

For more information on how to use VectorStoreIndex?, checkout this guide from LlamaIndex?—?https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index/

We will specify our own open-source embedding model (as opposed to the LlamaIndex’s default OpenAI embedding model which is closed-source). We use the popular embedding model BAAI/bge-small-en-v1.5 as follows (you are welcome to experiment with other embedding models; I tried using Llama3.2 embeddings but achieved very poor results):

# set BAAI/bge-small-en-v1.5 as vector store embedding model 
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

vector_store_embedding = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

Now, we set-up our RAG index as follows:

import os
from llama_index.core import (
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)

if not os.path.exists("storage_nodes"):
    index = VectorStoreIndex(text_nodes, embed_model=vector_store_embedding)
    # save index to disk
    index.set_index_id("vector_index")
    index.storage_context.persist("./storage_nodes")
else:
    # rebuild storage context
    storage_context = StorageContext.from_defaults(persist_dir="storage_nodes")
    # load index
    index = load_index_from_storage(storage_context, index_id="vector_index", embed_model=vector_store_embedding)

What the above code does is, it looks for a dir called storage_nodes. If dir not found, it creates a new index using VectorStoreIndex(text_nodes, embed_model=vector_store_embedding) and saves it to disk under the same dir. If dir found, it loads index from the same dir using load_index_from_storage(storage_context, index_id="vector_index", embed_model=vector_store_embedding)

After the above code, we have our RAG index ready to be used via index.?

Building Multimodal Query?Engine

We now build a custom query engine which will extract out text and images from our TextNodes and feed it to our multimodal LLM.?

To briefly sum up our approach for this step, when we receive user query, we will use our index to retrieve top-k TextNodes which are similar to the user query. From those nodes, we extract out the texts and images and append it to user’s query to create a rich prompt. We will then feed the prompt to our multimodal LLM to give a response to the user.

Here’s a schematic explaining it:

Here’s the code do it:

# set LLama3.2-vision:11b-instruct-q4_K_M as our primary model and 
# perform a sanity check if it is working
from llama_index.llms.ollama import Ollama

llm_model=Ollama(model=MODEL, request_timeout=500)
response = llm_model.complete("What is the capital of France?")
print(response)

Code to build the custom query engine via LlamaIndex APIs:

from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import ImageNode, NodeWithScore, MetadataMode
from llama_index.core.prompts import PromptTemplate
from llama_index.core.base.response.schema import Response
from typing import Optional


QA_PROMPT_TMPL = """\
Use the image(s) information first and foremost. ONLY use the text/markdown information provided in the context
below if you can't understand the image(s).

---------------------
Context: {context_str}
---------------------
Given the context information and no prior knowledge, answer the query. Explain where you got the answer
from, and if there's discrepancies, and your reasoning for the final answer.

Query: {query_str}
Answer: """

QA_PROMPT = PromptTemplate(QA_PROMPT_TMPL)

class MultimodalQueryEngine(CustomQueryEngine):
    """Custom multimodal Query Engine.

    Takes in a retriever to retrieve a set of document nodes.
    Also takes in a prompt template and multimodal model.

    """

    qa_prompt: PromptTemplate
    retriever: BaseRetriever
    multi_modal_llm: Ollama

    def __init__(self, qa_prompt: Optional[PromptTemplate] = None, **kwargs) -> None:
        """Initialize."""
        super().__init__(qa_prompt=qa_prompt or QA_PROMPT, **kwargs)

    def custom_query(self, query_str: str):
        # retrieve text nodes
        nodes = self.retriever.retrieve(query_str)
        # create ImageNode items from text nodes
        image_nodes = [
            NodeWithScore(node=ImageNode(image_path=image_path))
            for n in nodes for image_path in n.metadata.get("image_paths", [])
        ]

        # create context string from text nodes, dump into the prompt
        context_str = "\n\n".join(
            [r.get_content(metadata_mode=MetadataMode.LLM) for r in nodes]
        )
        fmt_prompt = self.qa_prompt.format(context_str=context_str, query_str=query_str)

        # synthesize an answer from formatted text and images
        llm_response = self.multi_modal_llm.complete(
            prompt=fmt_prompt,
            image_documents=[image_node.node for image_node in image_nodes]
        )
        return Response(
            response=str(llm_response),
            source_nodes=nodes,
            metadata={"text_nodes": text_nodes, "image_nodes": image_nodes},
        )

Notice the prompt that we use (stored as QA_PROMPT_TMPL) to feed to our LLM. Providing some extra contextual instructions to the LLM on how to consume the provided data yields better results. This is a very simple form of prompt engineering. You are welcome to try out different prompting techniques.

Our custom query engine is now ready. We use it as follows to construct an LLM response. Let’s see it in action:

# 'similarity_top_k' signifies how many nodes need to 
# be retreived which are semantically most similar to the user query
# 
# 'similarity_top_k' is a hyper-parameter. Try experimenting with 
# different values for better results.
query_engine = MultimodalQueryEngine(
    retriever=index.as_retriever(similarity_top_k=5), multi_modal_llm=llm_model
)

# run user query and print response from LLM
response = query_engine.custom_query("Tell me about global LNG portfolio for ConocoPhillips")
print(response)

For the above query, I got the following response:

Based on the provided context, I can summarize the information about 
ConocoPhillips' global LNG (Liquefied Natural Gas) portfolio as follows:

ConocoPhillips has a diversified and expanding LNG portfolio with several 
key assets and projects. Some of the highlights include:

1. **Qatar Expansion**: ConocoPhillips is expanding its presence in Qatar, 
which is expected to become one of the largest LNG exporters globally. 
The company has a 10% shareholding interest in the North Field Expansion (NFE) 
project, which will increase Qatar's LNG production by an additional 
2 million tonnes per annum.
2. **Qatargas 3**: ConocoPhillips has a legacy position supplying Asian 
and European markets through Qatargas 3, with a long-term contract that 
covers approximately 90% of its volumes.
3. **North Field Expansion Projects**: The company is building on its 
20-year relationship with Qatar by partnering in the NFE project, 
which will bring online two new LNG trains (NFE and NFS) in 2026 and 2027, 
respectively.
4. **APLNG**: ConocoPhillips has a stake in the Australia Pacific LNG (APLNG) 
project, which is expected to deliver significant LNG volumes from 2018 
onwards.
5. **Port Arthur**: The company's Port Arthur facility in Texas, USA, 
serves as a liquefaction and export hub for its North American gas resources.

In terms of market presence, ConocoPhillips is the second-largest 
natural gas marketer in North America and has a global market presence 
with over 60 years of experience in LNG. The company is committed to 
disciplined reinvestment and has a rigorous capital allocation framework 
in place, which prioritizes returns-focused investments.

Overall, ConocoPhillips' global LNG portfolio is diversified across 
several regions, including Qatar, Australia, and North America, 
with a focus on delivering reliable cash flows and expanding its market 
presence.

**Sources:**

* Page 22 of the provided context (LNG: Expanding World-Class Assets)
* Page 31 of the provided context (LNG Opportunities Underpinned by Strong Commercial Acumen)
* Page 10 of the provided context (Strategy Powers Our Returns-Focused Value Proposition)

I checked the PDF file, and found the following pages 22, 31, and 10 which response references:

Let’s check it for another query

# run a query
response = query_engine.custom_query("Tell me about the diverse geographies where Conoco Phillips has a production base.")
print(response)

After 2nd retry, I got a relevant response:

Based on the provided text snippets, I can infer that ConocoPhillips has a significant presence in various regions. Here are some of the diverse geographies mentioned:

1. **Alaska**: The company has a legacy world-class basin in Alaska, specifically in the North Slope, Western Torok Kuparuk Area, and Greater Non-operated Prudhoe Bay. (Source: Page 26)
2. **Texas**: ConocoPhillips has a prolific acreage spanning over in Mexico, Texas, as part of its Permian Inventory. (Source: Page 38)
3. **Permian Basin** (Texas/New Mexico): The company has a significant presence in the Permian Basin, with a focus on the Delaware and Midland Basins. (Source: Page 38)
4. **Mexico**: ConocoPhillips has a substantial acreage of 659,000 net acres in Mexico, which is part of its prolific acreage spanning over in Texas. (Source: Page 38)

Additionally, there are mentions of other regions and basins, such as:

* North Dakota (not explicitly mentioned but implied through the context)
* The Gulf Coast region (implied through the mention of infrastructure and facilities)

However, I couldn't find any information about ConocoPhillips' production base in these regions.

It's worth noting that the provided text snippets focus on specific aspects of ConocoPhillips' operations, such as its legacy world-class basin in Alaska or its prolific acreage in Texas. While this information is not exhaustive, it provides a general idea of the company's diverse geographies where it has a production base.

Scanning the PDF leads me to the following pages which informed LLM’s response:

In summary, we can see that Llama3.2–11b-vision works decently well in summarizing and finding answers around details embedded in the pages of a large, complex PDF. There were some instances when it had trouble finding answers where I had to re-run the same query a few times to get the relevant answers.

Agentic Integration

The heavy-lifting is done! We have successfully built a lightweight solution using a multimodal LLM which can answer questions based on details from PDFs. Now, let’s a add an agentic layer to this. We’ll see how AI agents can integrate with the environment to provide more up-to-date, real-time information to user-queries. This will introduce us to the power of AI agents which can automatically infer (and execute) certain tasks to get the best possible response from LLMs.

What are AI agents and why do we need?them?

Let’s understand the usefulness of AI agents using an example. Imagine a laptop manufacturer receives the following query from a customer:

“Can you upgrade my laptop to replace my GPU with the latest Nvidia RTX 5090 GPU? If so, how much time will it take?”

Instead of directly sending this query to the engineers, there is usually a middle layer of customer support executives who facilitate customer and engineer interaction. Customer support should be intelligent enough to ask relevant questions from the customer to fill-in any information gaps, or should be able to look up internal company records to add any information which would help engineers understand and respond to the problem. So for our example, some information which the customer support might collect could be:

What is the make and model of the laptop?
Is the new GPU compatible with the laptop?
Is the new GPU available in company inventory?

If we compare the above analogy with an agentic AI system, then the engineers would equate to the LLM, and the customer support would equate to the AI agent.

AI agents are software programs that leverage artificial intelligence to perform tasks, process data, and make decisions on their own. They are usually designed to improve user experience and enhance the output of an LLM. In fact, typically, an AI agent itself uses a separate LLM to make decisions around the tasks to execute before communicating with the primary LLM.

So, the key points we can infer about AI agents are:

They will be a layer added on-top of our primary LLM (in our case the Llama3.2-vision)
They will be intelligent enough to carry out certain tasks which enhances the output of the primary LLM
They will use a separate LLM to make decisions for itself and carry out its tasks

Let’s now build a simple AI agent to understand how it works.

Select the Agentic LLM (Tool-calling Support)

We’ll use LlamaIndex to build an AI agent. As discussed above, our agent will need a separate LLM to operate. Not every LLM is designed to have agentic support. If we look at the list of Ollama models, one can see certain models tagged with “tools”. Those are the ones with agentic support.

I used Llama3.1 for our agent. You can run the following command and download the model to run it on your machine:

# downloads and runs the llama3.1 model locally
# model size - 4.7GB
# parameters - 8B
ollama run llama3.1

I chose the 8B variant (default). You can try the 70B or 405B variant as well.

Set up the Agent using LlamaIndex

Let’s dive in to the code. First, let us setup our agentic LLM:

# set LLama3.1 as the agentic model which has tool-calling support
# perform a sanity check if it is working
from llama_index.llms.ollama import Ollama

llm_model_tool_calling=Ollama(model="llama3.1")
response = llm_model_tool_calling.complete("Are you able to process image inputs?")
print(response)

Before proceeding, let’s decide upon the user query we would want to ask our (primary) multimodal LLM via our AI agent. If we go back to our tutorial’s original PDF from Conoco Philips, we can ask the following query:

“What was the average cost of supply in 2016? Convert the amount to INR based on current exchange rate.”

Notice, that the first part of the question about average cost of supply will be available in the PDF. But the second part about currency conversion will not. Further, our primary LLM will not be equipped to answer it. This is because currency rates fluctuate every day. So the answer to the query would require us to pull in the current exchange rate between USD and INR and then use it to get the actual value.

Let’s see what our primary LLM responds to this without agentic integration:

# run a query
response = query_engine.custom_query("What was the average cost of supply in 2016? Convert the amount to INR based on current exchange rate.")
print(response)

Output:

Based on the text, the average cost of supply in 2016 was $40/BBL WTI. 
To convert this amount to Indian Rupees (INR) using the current 
exchange rate, I'll use an approximate conversion rate of 1 USD = 75 INR.

$40/BBL WTI ≈ ?3000 per barrel

Note: Please keep in mind that currency exchange rates may fluctuate 
frequently and might not be up-to-date at the time of my response. 
For accurate conversions, please check current exchange rates.

This answer is derived from the text on page 15, which states:

"... Resource: ~ $40/BBL WTI ... Average Cost of Supply: $40/BBL WTI ..."

This is incorrect! As of Feb 9th, 2025, current exchange rate is 1 USD ~ 87.5 INR.

This is where the true power of AI agents emerges! We will build an AI agent which will understand our query, pull the exchange rate real-time, do the currency conversion, and respond with the correct answer.

So, to sum up, our agent will need a way to:

Use the original RAG solution we built in the 1st half of this article and feed it to primary, multimodal LLM.
Do real-time currency conversion and plug it in the final answer

Above two tasks will be done using what are called “tools” in the agentic world. Vaguely speaking, “tools” are functions which the agent will be able to call based on the user query and the data available to get the job done. We define tools for our use-case as follows:

import requests

from llama_index.core.tools import QueryEngineTool
from llama_index.core.agent import FunctionCallingAgentWorker

from llama_index.core.tools import FunctionTool
from pydantic import Field

# custom function which will be integrated as tool for our agent
def currency_converter(from_currency_code: str = Field(
        description="Country code of the currency to convert from (e.g., USD, INR, EUR)"
    ), to_currency_code: str = Field(
        description="Country code of the currency to convert to (e.g., USD, INR, EUR)"
    ), amount: float = Field(
        description="Currency amount to convert"
    )) -> float:

    # free API for currency exchange rates. No auth required
    api_url = f"https://api.vatcomply.com/rates?base={to_currency_code}"
    
    response = requests.get(api_url)
    data = response.json()
    
    if "error" in data:
        raise ValueError(data["error"])
    
    rates = data["rates"]
    conversion_factor = rates[from_currency_code]
    converted_amount = float(amount) / conversion_factor
    return converted_amount

# TOOL for currency conversion
# This is a custom tool based on our custom defined function
currency_converter_tool = FunctionTool.from_defaults(
    currency_converter,
    name="currency_converter_tool",
    description="Converts currency from one country code to country code based on current exchange rate. "
    "Takes the currency amount value, the country code of the currency to convert from, and the country code "
    "of the currency to convert to as input.",
)

# TOOL for querying the engine to retrieve contextual information around user query
# LlamaIndex provides QueryEngineTool class to quickly integrate with an existing RAG based query engine
query_engine_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="query_engine_tool",
    description=(
        "Useful for retrieving specific context from the data. Do NOT select if question asks for a summary of the data."
    ),
)

LlamaIndex provides convenient APIs to define tools which our agent can use. Our two tools are initialized as query_engine_tool and currency_converter_tool. An important thing to note is to set the name and description properties of the tools. It is also helpful to define the description property for tool function arguments (currency_converter in our case). This is because our agent uses these properties to infer which tool to call and how (using the agentic LLM underneath?—?Llama 3.1 for us).

Now, let’s initialize our agent using LlamaIndex.

# Set-up the agent for calling the currency conversion and query engine tools
agent = FunctionCallingAgentWorker.from_tools(
    [currency_converter_tool, query_engine_tool], llm=llm_model_tool_calling, verbose=True
).as_agent()

Notice how we are passing the agentic LLM as llm_model_tool_calling. Our primary multimodal LLM (Llama3.2-vision) support is integrated via query_engine inquery_engine_tool.

That’s it. We are now ready to use our AI agent to execute user queries. Let’s see it in action by sending the same query about currency conversion as we did before.

Running Queries

Let’s run our query from earlier:

# Query 1
query = (
    "What was the average cost of supply in 2016? Convert the amount to INR based on current exchange rate."
)
# using agent to run the query
response = agent.query(query)
print(response)

Output:

Added user message to memory: What was the average cost of supply in 2016? 
Convert the amount to INR based on current exchange rate.
=== Calling Function ===
Calling function: query_engine_tool with args: 
{"input": "average cost of supply in 2016"}
=== Function Output ===
The average cost of supply in 2016 is not explicitly mentioned in the 
provided text. However, based on the information provided on page 37, 
it appears that the cost of supply in 2016 was around $35/BBL.

From the chart on page 37, we can see a graph showing production (MBOED) 
and associated costs over several years. The chart shows:

* Production: 1,200 MBOED in 2022
* Cost of Supply: approximately $32/BBL in 2016

However, this is for the Lower 48 Unconventional Production only. 
It's not mentioned what the cost of supply was for all ConocoPhillips 
operations in 2016.

Given the context and available information, I can only provide an estimate 
based on the chart provided on page 37, which suggests that the average 
cost of supply in 2016 was around $32/BBL for Lower 48 Unconventional 
Production. However, this might not be representative of ConocoPhillips' 
overall operations.

Answer: approximately $32/BBL (for Lower 48 Unconventional Production only)
=== Calling Function ===
Calling function: currency_converter_tool with args: 
{"amount": "1", "from_currency_code": "USD", "to_currency_code": "INR"}
=== Function Output ===
87.51421412739712
=== LLM Response ===
Based on the current exchange rate (1 USD = 87.51 INR), the average 
cost of supply in 2016 for ConocoPhillips' Lower 48 Unconventional 
Production is approximately:

$32/BBL * 87.51 INR/USD ≈ ?2,805.12/BBL
Based on the current exchange rate (1 USD = 87.51 INR), the average cost 
of supply in 2016 for ConocoPhillips' Lower 48 Unconventional 
Production is approximately:

$32/BBL * 87.51 INR/USD ≈ ?2,805.12/BBL

And voila! Our agent was intelligent enough to pull relevant data at hand and do all the necessary conversions.?

Let’s ask another query where currency conversion tool need NOT be called.

# Query 2
query = (
    "Tell me about different locations where Conoco Philips has a production base."
)
response = agent.query(query)
print(response)

Output:

Added user message to memory: Tell me about different locations where Conoco Philips has a production base.
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Locations where Conoco Philips has a production base"}
=== Function Output ===
Based on the provided text snippets, I can identify several locations associated with ConocoPhillips' operations:

1. **Alaska**: Mentioned in the snippet about the Willow Project, which is located in Alaska.
2. **Gulf of Mexico**: Referenced in the context of the company's exploration and production activities.
3. **Delaware Basin** (Permian Basin): Discussed as a significant area for ConocoPhillips' operations, with a focus on unconventional oil and gas development.
4. **Texas**: Mentioned in the context of the Permian Basin, specifically in relation to the Delaware Basin.
5. **Mexico**: Identified as a country with new prolific acreage spanning over 659,000 net acres.

These locations are mentioned across various snippets, indicating that ConocoPhillips has a production base or significant operations in these areas.

However, there is no explicit mention of other locations, such as the Middle East, Asia, or Europe, which might be home to the company's production bases. The text focuses primarily on North American regions.

Given this information, I can provide an answer:

Locations where Conoco Philips has a production base: **Alaska**, **Gulf of Mexico**, **Delaware Basin (Permian Basin) in Texas**, and **Mexico**.

Note that there may be other locations not mentioned in the provided text snippets, and this answer is based solely on the information available.
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Locations where Conoco Philips has a production base"}
=== Function Output ===
Based on the provided text snippets, I was unable to find explicit information about the specific locations where ConocoPhillips has a production base. However, I did come across some relevant mentions:

1. Delaware Basin (p.38): The text states that ConocoPhillips has a "vast inventory with proven track record of performance" in the Delaware Basin.
2. Permian Inventory (p.38): It's mentioned that ConocoPhillips has a total Permian inventory, but it doesn't specify where exactly within the Permian Basin they have operations.
3. Mexico (p.38): The text mentions "New Prolific Acreage Spanning Over Mexico" with 659,000 Net Acres. This could imply that ConocoPhillips has operations in Mexico.

However, I couldn't find any information about their production base locations outside of these regions.

From the provided snippets:

* p.5: It doesn't mention specific location but mentions they have a deep, durable and diverse portfolio.
* p.37-38: Focuses on their Permian Basin operations, specifically in Texas (Delaware Basin) and possibly Mexico.
* p.35-36: Discusses their global presence without specifying exact locations.

Therefore, based on the information provided, I will provide an answer with some caveats:

Locations where ConocoPhillips has a production base might include:

1. Delaware Basin (Texas)
2. Permian Inventory (unknown specific location within the Permian Basin)
3. Mexico
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Conoco Philips Production bases in Europe"}
=== Function Output ===
The query is about Conoco Philips' production bases in Europe. After reviewing the provided text snippets, I found information related to Conoco Phillips' operations, but I couldn't find any specific information about their production bases in Europe.

However, I did find a mention of "Europe" in one of the text snippets:

"Improved Performance from 2019 to 2022"

This snippet is not relevant to the query. The snippet that might be related to Conoco Phillips' operations in Europe is:

"Delaware Basin Well Capex/EUR ($/BOE)"

However, this snippet does not mention Europe either.

Upon reviewing all text snippets again, I couldn't find any direct information about Conoco Philips' production bases in Europe.

Since there's no specific information available, my answer would be:

"I couldn't find any information about Conoco Philips' production bases in Europe."

I got the answer from a thorough review of the provided text snippets. If you have more information or context about Conoco Phillips, I may be able to help you better.
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Conoco Philips Production bases in Asia"}
=== Function Output ===
Based on the provided text snippets, I couldn't find any direct mention of ConocoPhillips' production bases in Asia. However, I can make an inference based on a specific snippet related to their operations in Canada and Australia.

In snippet 5 (page 5), there is no direct mention of ConocoPhillips' production bases in Asia. But in snippet 23 (page 23), under the section "Surmont: Leveraging Low Capital Intensity for Decades of Flat Production," it mentions "Niobrara and Australia-West Exits" which implies that they have operations in these regions.

However, I couldn't find any direct mention of production bases specifically in Asia. If we consider snippet 23 as a source, then we can say that ConocoPhillips has some presence (possibly in the form of exits) in the Australian region, but not necessarily a production base.

Therefore, based on the provided information and without further context, I would say:

**Answer:** There is no conclusive evidence to confirm ConocoPhillips' production bases in Asia. However, they do have some operations or presence in Australia.

Please note that this answer may be subject to change if more information becomes available or if there's any discrepancy between my inference and actual facts.
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Conoco Philips Production bases in the Middle East"}
=== Function Output ===
I couldn't find any information about ConocoPhillips' production bases in the Middle East from the provided text. The text mentions various locations such as Germany, Singapore, London, Houston, Calgary, Beijing, and Tokyo, but it does not mention the Middle East.

However, I can look at the resource list (Conventional, International, Alaska) under the "Production Mix" section of the text. Under this list, it says "International Gas ~30%". This could potentially imply that ConocoPhillips has some international gas production bases, but it does not specifically mention the Middle East.

Therefore, based on the information provided in the text, I would have to say:

"No, there is no specific information about ConocoPhillips' production bases in the Middle East."

But it's also possible that ConocoPhillips might have some production presence in the Middle East under the "International" category, but without more specific information, it's hard to confirm.
None

Notice a few things happening in above 2 queries:

For query #1, the agent was intelligent enough to call query_engine_tool first and plug its output to currency_converter_tool
For query #1, the agent also understood what parameter values needed to be sent to our custom defined currency_converter function
For query #2, the agent was intelligent enough to NOT call the currency_converter_tool?
For query #2, the agent goes above and beyond by asking similar questions to the primary LLM on top of the original query. For example, it tries to specifically ask the primary LLM if there are geographies present in Asia or Middle-East, even though we didn’t explicitly ask it to.

As you can see, AI agents can be very powerful tools with amazing potential to improve exisiting AI systems. They can save a lot of time and effort by using their own intelligence to generate a rich and thorough response in the end.

Summary & Observations

Here’s what I observed while building this solution:

The embedding model you choose has a big impact on the efficiency of RAG. Initially, I used the embeddings provided by Llama3.2-vision itself. But that yielded very poor results. So I shifted to BGE embedding model which significantly improved quality of response
Shifting to BGE embeddings also improved performance. I witnessed 2–3x speed up in RAG indexing speeds.
The llama3.2-vision:11b-instruct-q4_K_M variant yielded better results for the task at hand than the llama3.2-vision
Running the same query multiple times yielded different (sometimes wrong or info-not-found) responses.?
Agents incur a performance hit on the overall response latency. Generating an LLM response via agents is slower than without agents.

In summary, the above solution works well for consuming large, complex PDFs which provides a good base to conduct experiments for production-readiness. Using larger and better LLMs such as the 90B variant of Llama3.2-vision might produce even better results.?

Code

Find the full code of this solution at the following git repository:

https://github.com/AvneeshKhanna/llm-tutorial-agentic-multimodal-rag/tree/main

Agentic RAG solution for LLMs which can understand PDFs with mutliple images and diagrams

Avneesh Khanna

AI | Amazon SageMaker AI | Carnegie Mellon University

Introduction

Why build?this?

Multimodal LLM: The?Brain

Set-up & Installation

Multimodal RAG

Convert Raw PDFs to Indexable Nodes

Build the?Index

Building Multimodal Query?Engine

领英推荐

Agentic Integration

What are AI agents and why do we need?them?

Select the Agentic LLM (Tool-calling Support)

Set up the Agent using LlamaIndex

Running Queries

Summary & Observations

Code

References

社区洞察

其他会员也浏览了

??Top ML Papers of the Week

Creating a Product Support AI Agent using Natural Language

What is RAG? (A Guide)

AI Agents & Knowledge Graphs

Function Calling AI: Transforming Text Models into Dynamic Agents

What are Retrieval Augmented Generation (RAG) Systems?

Semantic Kernel: Unlocking the Mysteries of Machine Language Understanding

Product problem considerations when building Large Language Model based applications

Dave Tales Edition #26 | Exploring Vector Data Storage Techniques in Large Language Models

GenAI Roadmap: Know Your Data Sources