Summary of Agent Whitepaper by Google
I used Google Notebook to summarize this 42 page Whitepaper by Google on Agents.
1. Introduction
- Core Idea: The paper introduces the concept of Generative AI agents as applications that go beyond the capabilities of standalone models. Agents are designed to mimic human problem-solving by combining reasoning, logic and access to external tools.
- Human Analogy: Just as humans use tools to supplement knowledge, agents use tools to access real-time data and perform actions.
- Quote: "This combination of reasoning, logic, and access to external information that are all connected to a Generative AI model invokes the concept of an agent"
2. What is an Agent?
- Definition: A Generative AI agent is an application that attempts to achieve a goal by observing the world and acting upon it using the available tools.
- Autonomy: Agents are autonomous, acting independently to achieve goals without constant human intervention.
- Proactive: Agents can reason about next steps to achieve goals, even without explicit instructions.
- Scope: The whitepaper focuses specifically on agents that Generative AI models are capable of building.
- Cognitive Architecture: Agents are driven by a cognitive architecture comprised of various components, the key 3 being the model, tools and orchestration layer.
3. Foundational Components of an Agent's Cognitive Architecture
- 3.1 The Model
- Role: The language model (LM) acts as the central decision maker for the agent.
- Variety: Can be one or multiple LMs of different sizes and capabilities (e.g. general purpose, multimodal, fine-tuned).
- Reasoning: Models leverage reasoning frameworks such as ReAct, Chain-of-Thought, and Tree-of-Thoughts.
- Training: While models are not trained on the specific agent configuration, they can be refined with examples showcasing agent capabilities.
- 3.2 The Tools
- Purpose: Tools bridge the gap between language models and the outside world, allowing for interaction with external data and systems.
- Functionality: Enables agents to perform actions beyond the base model, like retrieving real-time data, updating databases, and making API calls.
- Examples: Include updating customer info in databases, fetching weather data, adjusting smart home settings and sending emails.
- Quote: "Foundational models, despite their impressive text and image generation, remain constrained by their inability to interact with the outside world. Tools bridge this gap, empowering agents to interact with external data and services while unlocking a wider range of actions beyond that of the underlying model alone."
- Tool Types: Focus is on three types of tools that Google models can interact with at the time of publishing: Extensions, Functions, and Data Stores.
- 3.2.1 Extensions
- Function: Bridge APIs and agents in a standardized way, enabling agents to execute APIs seamlessly.
- How They Work: They provide agents with examples and details on API parameters, allowing them to select the right extension dynamically.
- Example: A user query like “I want to book a flight from Austin to Zurich†is handled through an extension that can interface with an API, retrieving the required flight information.
- Flexibility: Extensions are configured independently of the agent, but form part of its overall configuration.
- Quote: “Extensions can be crafted independently of the agent, but should be provided as part of the agent’s configuration. The agent uses the model and examples at run time to decide which Extension, if any, would be suitable for solving the user’s query.â€
- Example: The "Code Interpreter" extension allows the agent to generate and execute Python code based on natural language descriptions.
- 3.2.2 Functions
- Analogy: Similar to software development functions, they are self-contained code modules performing specific tasks.
- Model Decision: A model decides when to use a function and provides arguments based on the function’s specification.
- Execution: Functions are executed client-side, whereas Extensions are executed agent-side.
- Advantages: Developers have granular control, suitable for scenarios where API calls must occur outside the agent’s direct flow, due to security restrictions or other timing constraints.
- Use Cases: API calls in middleware, front-end, for authentication, asynchronous operations, stubbing of APIs.
- Example: Function calling used in a travel planning agent that converts a list of city names into JSON that is then sent to the client side server to pull images.
- 3.2.3 Data Stores
- Purpose: Address the limitation of models being constrained by their training data, providing access to dynamic, up-to-date information.
- Functionality: Enable agents to access and integrate external data, like spreadsheets, PDFs, and other documents.
- Implementation: Typically implemented as a vector database. Incoming documents are converted into vector embeddings and matched with user queries.
- Usage: Often used in Retrieval Augmented Generation (RAG) applications for extending model knowledge.
- Examples: Website content, structured and unstructured data in different formats.
- Process: A user query is converted into embeddings and matched against vector database content, and the retrieved content is sent back to the agent.
- 3.3 The Orchestration Layer
- Function: Governs how the agent processes information, reasons, and makes decisions.
- Process: The orchestration layer manages a cyclical process that continues until the agent achieves its goal or reaches a stopping point.
- Complexity: Can range from simple calculations to complex chained logic and probabilistic reasoning techniques.
- Role: Responsible for maintaining memory, state, reasoning and planning.
- Frameworks: Uses prompt engineering frameworks to guide reasoning and planning (e.g., ReAct, Chain-of-Thought, Tree-of-Thoughts).
- Quote: “At the core of agent cognitive architectures lies the orchestration layer, responsible for maintaining memory, state, reasoning and planning.â€
- ReAct Framework Example:
- Process: User query --> Thought --> Action (tool choice) --> Action Input --> Observation --> Repeat until Final Answer.
- Example: The whitepaper provides an example of using ReAct to choose the correct tools for a user query about flights.
4. Agents vs. Models
- Key Differences:Knowledge: Models have limited knowledge based on training data, while agents extend knowledge through external systems.
- Inference: Models perform single inference/prediction without context management. Agents manage session history for multi-turn inference.
- Tools: Agents natively implement tools; models do not.
- Architecture: Agents have a native cognitive architecture; models need prompts to guide predictions.
5. Cognitive Architectures: How Agents Operate
- Human Analogy: Agents are likened to a chef, who plans, executes, and adjusts based on available information.
- Core Function: Agents use cognitive architectures to process information, make decisions, and refine actions.
- Orchestration Layer: The core of cognitive architecture, handling memory, state, reasoning and planning.
- Popular Frameworks:ReAct: A framework for language models to reason and act on queries.
- Chain-of-Thought (CoT): Enables reasoning through intermediate steps.
- Tree-of-Thoughts (ToT): Suited for exploration and strategic lookahead tasks.
6. Enhancing Model Performance with Targeted Learning
- Problem: General model training may not be sufficient for complex, real-world tool use.
- Approaches:In-context learning: Model learns on the fly with prompts, tools, and examples at inference time.
- Retrieval-based in-context learning: Dynamically populates the prompt with relevant info from external memory.
- Fine-tuning: Training the model on a dataset of specific examples before inference.
- Cooking Analogy:In-context learning: A chef learning to cook a new dish based on a recipe, few ingredients and an example.
- Retrieval-based learning: A chef using a pantry of ingredients and cookbooks to enhance the dish.
- Fine-tuning: Sending a chef back to school to learn a specific cuisine.
7. Agent Quick Start with LangChain
- Purpose: Showcases a practical agent example using the LangChain and LangGraph libraries.
- Example: Building an agent using Gemini 1.5 Flash and tools like SerpAPI (Google Search) and Google Places API.
- Functionality: The agent is able to answer multi-stage queries, including searching for information and looking up addresses.
- Components: Model, orchestration, and tools, all working together.
8. Production Applications with Vertex AI Agents
- Focus: Integration of agents with user interfaces, evaluation frameworks, and continuous improvement mechanisms.
- Vertex AI Platform: Offers a fully managed environment for building production-grade agents.
- Features: Natural language interface, tool definitions, sub-agent task delegation, testing and evaluation tools, managed infrastructure.
- Key takeaway: Simplifies agent development, allowing developers to focus on functionality.
9. Summary & Key Takeaways
- Agents as Extensions: Agents extend LLMs by enabling them to leverage tools for real-time access, actions and autonomous task planning.
- Orchestration Layer: The key driver for agent operation, structuring reasoning and decision making with frameworks such as ReAct, CoT and ToT.
- Tools: Extensions, Functions and Data Stores provide the means for agents to interact with the real world. Extensions bridge the gap with external APIs. Functions offer client-side control. Data Stores provide access to dynamic data.
- Future Direction: "Agent chaining," which combines specialised agents, is a promising approach for solving complex problems in different domains.
- Iterative Development: Building complex agents requires experimentation and refinement.
In Conclusion: This whitepaper provides a good overview of how Generative AI agents work, their core components and how they interact to execute specific tasks. The distinction between models and agents is clear, and the examples for how tools can be implemented are valuable. It highlights the importance of the orchestration layer, cognitive architectures and targeted learning in enhancing model performance and how agents are evolving to address complex problem-solving scenarios.
Thank you for sharing and for mentioning SerpApi in one of the examples!
Co Founder at Starting Partners
2 个月Thanks for sharing this Neil