From Reasoning to Action: Understanding AI Agents With Simple Program

From Reasoning to Action: Understanding AI Agents With Simple Program

Artificial Intelligence (AI) continues to evolve, and one of the most exciting developments is the concept of AI Agents. These agents operate autonomously to perform tasks, leveraging the power of tools, memory, and reasoning abilities to achieve their goals. In this article, we’ll dive deep into what AI Agents are, their critical components, and how they work, with practical examples to demonstrate their applications.

1. What are AI Agents?

An AI agent is an autonomous system that uses AI to interact with its environment, perceive information, and make decisions to achieve specific tasks. Unlike traditional models, these agents can access tools, use memory, plan their actions, and reflect on their decisions in real time. They go beyond mere question-answering, aiming to execute complex tasks by reasoning, fetching data, and dynamically adjusting based on new inputs.

2. Building Blocks of AI Agents

AI agents consist of several key components that work together to enable their autonomy and decision-making. Let’s explore each of these in detail:

2.1 Memory

AI agents rely on memory to store and retrieve information. This can be divided into:

  • Short-term memory: Stores immediate context and temporary information during a conversation or task.
  • Long-term memory: Helps the agent recall previously encountered information across sessions, improving continuity.

Using Retrieval-Augmented Generation (RAG), AI agents can retrieve relevant external knowledge from large databases or documents, enabling them to access data that may not be part of their initial training.

2.2 Tools

AI agents are equipped with various tools to interact with the environment and perform actions. Some of the most common tools include:

  • API Calls: Allows the agent to query third-party services to retrieve data, such as weather, financial information, or customer databases.
  • Utility Calls: The agent can use utilities such as calendars, file systems, or other software tools to execute specific tasks.
  • Code Interpreter: The agent can execute code to perform computations, analyze data, or manipulate information dynamically.
  • Search Functions: The agent can perform web searches or query specific databases, enabling it to fetch up-to-date information that is not part of its training set.

These tools provide AI agents with the ability to perform real-world tasks, making them more dynamic and capable of handling complex scenarios.

2.3 Planning

AI agents utilize planning mechanisms to break down complex problems into manageable tasks. Key components of planning include:

  • Reflection: The ability to reflect on past actions and adjust strategies based on success or failure.
  • Self-Critique: Agents evaluate their own decisions and performance, improving over time by learning from mistakes.
  • Chain of Thoughts: AI agents can reason step-by-step, allowing them to handle multi-step problems.
  • Subgoal Decomposition: Breaking down a larger task into smaller, more achievable goals, ensuring better task management and execution.

2.4 Action

Finally, the Action component involves executing the planned steps, whether that’s calling an API, performing a search, or running computations. AI agents can dynamically decide when and how to act, adjusting based on the outcomes of previous actions.


3. Why AI Agents?

While Large Language Models (LLMs) like GPT-3 and GPT-4 are powerful, they have limitations in reasoning, accessing real-time information, and performing complex tasks that require decision-making and planning. LLMs are restricted to the data they were trained on, lacking access to live data, databases, and domain-specific knowledge.

AI agents address these limitations by combining LLMs with external tools and reasoning capabilities, enabling them to handle more complex tasks dynamically.

Limitations of LLMs:

  • Static knowledge: Only trained on past data, often outdated.
  • No real-time data: Cannot access live or proprietary information.
  • Limited reasoning: Struggle with multi-step reasoning and complex decision-making.
  • No autonomous action: Provide responses but cannot perform actions like querying databases or executing tasks.

How AI Agents Overcome These Limitations:

  • Complex Reasoning: AI agents implement multi-step reasoning, maintaining context across interactions and breaking down tasks into manageable steps. This allows agents to perform detailed analyses like summarizing sections, retrieving technical details, and reflecting on insights.
  • Observation and Action (ReAct Agents): AI agents observe their environment (e.g., external tools and APIs) and take actions based on those observations. This continuous observation-action loop allows the agent to adjust its strategy dynamically, refining results or initiating further actions like API calls or document retrieval.
  • Retrieval-Augmented Generation (RAG): AI agents use RAG to access real-time or domain-specific data:
  • Autonomous Action and Planning: AI agents can autonomously plan and execute complex tasks by decomposing them into sub-goals and determining the best route to achieve each goal. They act proactively, calling APIs, retrieving documents, or running computations without waiting for user prompts.

In short, AI agents extend the capabilities of LLMs by enabling real-time data retrieval, autonomous reasoning, and dynamic actions, making them ideal for complex workflows like research, automation, and decision-making.


4. Four Core Components of AI Agent

Let’s explore the core components of an AI agent using Llama Index, where each element adds distinct capabilities (Refer to the program below for the definition of each component):

4.1 FunctionTool

FunctionTool enables AI agents to execute dynamic functions (e.g., database searches, API calls) within their workflows, extending them beyond simple text responses.

4.2 FunctionCallingAgent

FunctionCallingAgent allows agents to invoke multiple tools dynamically during task execution, solving complex problems through interactions like API calls and searches.

4.3 AgentRunner

AgentRunner orchestrates the execution of agents, ensuring tasks are managed efficiently by coordinating multiple tools or agents in parallel or sequence.

4.4 FunctionCallingAgentWorker

FunctionCallingAgentWorker is responsible for executing individual tools in an agent’s workflow, handling specific tasks like API calls or data retrieval, enabling efficient task execution.


5. Agentic RAG: AI Agents with Retrieval-Augmented Generation (RAG)

5.1 What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that combines the generative capabilities of large language models (LLMs) with the retrieval of relevant information from external knowledge sources (e.g., documents, databases, or APIs). In RAG, instead of relying solely on pre-trained knowledge, the model queries external data to augment its responses, making the results more accurate, up-to-date, and context-specific.

5.2 What is the RAG Pipeline?

The RAG pipeline consists of a structured process where external data is ingested, processed, and then used in combination with an LLM for generating accurate responses. The key steps include:

  1. Query Generation: The agent generates a query based on the user's input.
  2. Retrieval: Relevant information is fetched from external knowledge sources (e.g., vector databases or document stores).
  3. Augmentation: The retrieved information is passed to the LLM to augment its understanding and help generate a more informed response.
  4. Generation: The LLM uses the augmented data to produce a more accurate and contextually relevant answer.

5.3 What is Ingestion in the RAG Pipeline?

Ingestion refers to the process of converting external documents or data into a format that can be queried by the agent. Typically, this involves:

  • Document Parsing: Breaking down documents or databases into chunks that are meaningful and retrievable.
  • Vectorization: Converting parsed information into vector embeddings, which allow for efficient similarity searches.
  • Indexing: Storing these embeddings in a vector index or document store so that they can be easily queried when needed.

Once the ingestion is complete, the agent can retrieve the relevant pieces of information and use them in conjunction with the LLM for response generation.

6. What is the ReAct Agent?

The ReAct Agent (short for Reasoning and Acting) represents a distinct approach to building AI agents that can dynamically reason through complex tasks by engaging in a loop of Observation and Action.

6.1 Observation & Action in ReAct Agent

  • Observation: The agent observes the outcome of its actions, whether it’s a response from a tool, new information from an API, or a failed attempt at solving a problem. These observations are used to refine its understanding of the task at hand.
  • Action: Based on its observations, the agent determines the next step. The action could involve retrieving more information, performing a search, querying APIs, or executing code. Each action feeds back into the cycle, creating a closed loop of continuous learning and task execution.

This observation-action loop enables the ReAct Agent to adjust its course mid-execution, allowing it to handle complex tasks with multiple decision points. It’s ideal for scenarios where the agent needs to gather information incrementally and act based on intermediate results.

6.2 What is the ReAct Prompt?

The ReAct Prompt is a structured way to guide the agent’s decision-making process by combining natural language instructions with reasoning steps. It serves as both an input format and a strategy for breaking down the agent’s interactions into a logical flow of observation and action. There are various open source frameworks like Langchain or CrewAI that provide different prompt templates

6.3 Input and Output in ReAct Agent

  • Input: The input to the ReAct Agent typically consists of a combination of user queries and environmental data. The user may provide a task or question, and the agent uses this input along with its internal and external tools (APIs, search engines, memory, etc.) to generate a plan of action. In addition, ReAct agents can dynamically adjust their input by interpreting intermediate results (observations) and refining their subsequent actions based on this data.
  • Output: The output of a ReAct Agent is often multi-layered. It includes not only the final response or solution to the user's query but also intermediate outputs such as actions taken, search results fetched, code executed, and any observations made along the way. This makes the ReAct approach highly suitable for complex workflows where understanding the reasoning process is as important as the final outcome.

7. Agentic RAG in Action (you will need google collab and openai key to run this program)

#install all dependencies

!pip install llama-index
!pip install duckduckgo-search
!pip install openai transformers        
# Import necessary libraries

from llama_index.core.tools import FunctionTool
from duckduckgo_search import DDGS
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.agent import FunctionCallingAgent, FunctionCallingAgentWorker, AgentRunner
from llama_index.core import SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SummaryIndex, VectorStoreIndex
from llama_index.core.tools import QueryEngineTool
from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.agent import ReActAgent

# -----Define the Search Tool -----
def search(query: str) -> str:
    """
    Args:
        query: User prompt
    Returns:
        context (str): Search results based on the user query
    """
    req = DDGS()  # Use DuckDuckGo Search for results
    response = req.text(query, max_results=4)
    context = ""
    for result in response:
        context += result['body'] + "\n"
    return context

# Create a FunctionTool from the search function
search_tool = FunctionTool.from_defaults(fn=search)

# ----- Set up OpenAI LLM and OpenAI Embedding Model -----
openai_api_key = ""  # Set your OpenAI API key here

# Use OpenAI for the LLM (GPT-3.5-turbo or GPT-4, based on your model access)
llm = OpenAI(api_key=openai_api_key, model="gpt-3.5-turbo")  # Replace with your model name

# Use OpenAI for embeddings
embed_model = OpenAIEmbedding(api_key=openai_api_key)

# Set up the LLM and embedding model in ServiceContext
Settings.llm = llm
Settings.embed_model = embed_model

# ----- Function Calling Agent Example -----
query = "Who won the FIFA World Cup 2022 in Qatar Saudi Arabia?"

# Function calling directly with the tools
function_call = llm.predict_and_call(
    [search_tool],
    user_msg=query,
    allow_parallel_tool_calls=True
)
print("Function Call Response:", function_call.response)

# ----- Function Calling Agent and AgentRunner -----

# FunctionCallingAgent Example
agent = FunctionCallingAgent.from_tools(
    [search_tool],
    llm=llm,
    verbose=True,
    allow_parallel_tool_calls=True
)

response = agent.chat(query)
print("FunctionCallingAgent Response:", response)

# AgentRunner Example
agent_worker = FunctionCallingAgentWorker.from_tools(
    [search_tool],
    llm=llm,
    verbose=True,
    allow_parallel_tool_calls=True
)

agent_runner = AgentRunner(agent_worker)
runner_response = agent_runner.query(query)
print("AgentRunner Response:", runner_response.response)

# ----- Agentic RAG -----

# Load your documents (example with a PDF)
documents = SimpleDirectoryReader(input_files=["/content/kt.pdf"]).load_data()

# Split the document into nodes for processing
splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

# Build summary and vector indices
summary_index = SummaryIndex(nodes)
vector_index = VectorStoreIndex(nodes)

# Set up query engines for summary and vector retrieval
summary_query_engine = summary_index.as_query_engine(response_mode="tree_summarize", use_async=True)
vector_query_engine = vector_index.as_query_engine()

# Create tools from the query engines
summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    description="This tool summarizes the content in a simplified way."
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description="This tool retrieves relevant information from the internet."
)

# Set up RouterQueryEngine with a single selector, we can select different selector for multiple llms
query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=[summary_tool, vector_tool],
    verbose=True
)

# Example queries with Agentic RAG, retrieving data from uploaded pdf's
response_summary = query_engine.query("About knowledge transition")
print("Summary Tool Response:", response_summary)

response_vector = query_engine.query("Explain technical details from knowledge article?")
print("Vector Tool Response:", response_vector)

# ----- ReAct Agent Example -----

# Define a ReAct agent that uses the search tool and LLM
react_agent = ReActAgent.from_tools([search_tool], llm=llm, verbose=True, allow_parallel_tool_calls=True)

# Example ReAct agent query
react_response = react_agent.chat("Tell me about Generative AI.")
print("ReAct Agent Response:", react_response)
        

要查看或添加评论,请登录

Zahir Shaikh的更多文章

社区洞察

其他会员也浏览了