Summary of Agent Whitepaper by Google

Neil King

å‘å¸ƒæ—¥æœŸ: 2025å¹´1æœˆ8æ—¥

+ å…³æ³¨

I used Google Notebook to summarize this 42 page Whitepaper by Google on Agents.

1. Introduction

Core Idea: The paper introduces the concept of Generative AI agents as applications that go beyond the capabilities of standalone models. Agents are designed to mimic human problem-solving by combining reasoning, logic and access to external tools.
Human Analogy: Just as humans use tools to supplement knowledge, agents use tools to access real-time data and perform actions.
Quote: "This combination of reasoning, logic, and access to external information that are all connected to a Generative AI model invokes the concept of an agent"

2. What is an Agent?

Definition: A Generative AI agent is an application that attempts to achieve a goal by observing the world and acting upon it using the available tools.
Autonomy: Agents are autonomous, acting independently to achieve goals without constant human intervention.
Proactive: Agents can reason about next steps to achieve goals, even without explicit instructions.
Scope: The whitepaper focuses specifically on agents that Generative AI models are capable of building.
Cognitive Architecture: Agents are driven by a cognitive architecture comprised of various components, the key 3 being the model, tools and orchestration layer.

3. Foundational Components of an Agent's Cognitive Architecture

3.1 The Model
Role: The language model (LM) acts as the central decision maker for the agent.
Variety: Can be one or multiple LMs of different sizes and capabilities (e.g. general purpose, multimodal, fine-tuned).
Reasoning: Models leverage reasoning frameworks such as ReAct, Chain-of-Thought, and Tree-of-Thoughts.
Training: While models are not trained on the specific agent configuration, they can be refined with examples showcasing agent capabilities.
3.2 The Tools
Purpose: Tools bridge the gap between language models and the outside world, allowing for interaction with external data and systems.
Functionality: Enables agents to perform actions beyond the base model, like retrieving real-time data, updating databases, and making API calls.
Examples: Include updating customer info in databases, fetching weather data, adjusting smart home settings and sending emails.
Quote: "Foundational models, despite their impressive text and image generation, remain constrained by their inability to interact with the outside world. Tools bridge this gap, empowering agents to interact with external data and services while unlocking a wider range of actions beyond that of the underlying model alone."
Tool Types: Focus is on three types of tools that Google models can interact with at the time of publishing: Extensions, Functions, and Data Stores.
3.2.1 Extensions
Function: Bridge APIs and agents in a standardized way, enabling agents to execute APIs seamlessly.
How They Work: They provide agents with examples and details on API parameters, allowing them to select the right extension dynamically.
Example: A user query like â€œI want to book a flight from Austin to Zurichâ€ is handled through an extension that can interface with an API, retrieving the required flight information.
Flexibility: Extensions are configured independently of the agent, but form part of its overall configuration.
Quote: â€œExtensions can be crafted independently of the agent, but should be provided as part of the agentâ€™s configuration. The agent uses the model and examples at run time to decide which Extension, if any, would be suitable for solving the userâ€™s query.â€
Example: The "Code Interpreter" extension allows the agent to generate and execute Python code based on natural language descriptions.
3.2.2 Functions
Analogy: Similar to software development functions, they are self-contained code modules performing specific tasks.
Model Decision: A model decides when to use a function and provides arguments based on the functionâ€™s specification.
Execution: Functions are executed client-side, whereas Extensions are executed agent-side.
Advantages: Developers have granular control, suitable for scenarios where API calls must occur outside the agentâ€™s direct flow, due to security restrictions or other timing constraints.
Use Cases: API calls in middleware, front-end, for authentication, asynchronous operations, stubbing of APIs.
Example: Function calling used in a travel planning agent that converts a list of city names into JSON that is then sent to the client side server to pull images.
3.2.3 Data Stores
Purpose: Address the limitation of models being constrained by their training data, providing access to dynamic, up-to-date information.
Functionality: Enable agents to access and integrate external data, like spreadsheets, PDFs, and other documents.
Implementation: Typically implemented as a vector database. Incoming documents are converted into vector embeddings and matched with user queries.
Usage: Often used in Retrieval Augmented Generation (RAG) applications for extending model knowledge.
Examples: Website content, structured and unstructured data in different formats.
Process: A user query is converted into embeddings and matched against vector database content, and the retrieved content is sent back to the agent.
3.3 The Orchestration Layer
Function: Governs how the agent processes information, reasons, and makes decisions.
Process: The orchestration layer manages a cyclical process that continues until the agent achieves its goal or reaches a stopping point.
Complexity: Can range from simple calculations to complex chained logic and probabilistic reasoning techniques.
Role: Responsible for maintaining memory, state, reasoning and planning.
Frameworks: Uses prompt engineering frameworks to guide reasoning and planning (e.g., ReAct, Chain-of-Thought, Tree-of-Thoughts).
Quote: â€œAt the core of agent cognitive architectures lies the orchestration layer, responsible for maintaining memory, state, reasoning and planning.â€
ReAct Framework Example:
Process: User query --> Thought --> Action (tool choice) --> Action Input --> Observation --> Repeat until Final Answer.
Example: The whitepaper provides an example of using ReAct to choose the correct tools for a user query about flights.

4. Agents vs. Models

Key Differences:Knowledge: Models have limited knowledge based on training data, while agents extend knowledge through external systems.
Inference: Models perform single inference/prediction without context management. Agents manage session history for multi-turn inference.
Tools: Agents natively implement tools; models do not.
Architecture: Agents have a native cognitive architecture; models need prompts to guide predictions.

5. Cognitive Architectures: How Agents Operate

Human Analogy: Agents are likened to a chef, who plans, executes, and adjusts based on available information.
Core Function: Agents use cognitive architectures to process information, make decisions, and refine actions.
Orchestration Layer: The core of cognitive architecture, handling memory, state, reasoning and planning.
Popular Frameworks:ReAct: A framework for language models to reason and act on queries.
Chain-of-Thought (CoT): Enables reasoning through intermediate steps.
Tree-of-Thoughts (ToT): Suited for exploration and strategic lookahead tasks.

6. Enhancing Model Performance with Targeted Learning

Problem: General model training may not be sufficient for complex, real-world tool use.
Approaches:In-context learning: Model learns on the fly with prompts, tools, and examples at inference time.
Retrieval-based in-context learning: Dynamically populates the prompt with relevant info from external memory.
Fine-tuning: Training the model on a dataset of specific examples before inference.
Cooking Analogy:In-context learning: A chef learning to cook a new dish based on a recipe, few ingredients and an example.
Retrieval-based learning: A chef using a pantry of ingredients and cookbooks to enhance the dish.
Fine-tuning: Sending a chef back to school to learn a specific cuisine.

7. Agent Quick Start with LangChain

Purpose: Showcases a practical agent example using the LangChain and LangGraph libraries.
Example: Building an agent using Gemini 1.5 Flash and tools like SerpAPI (Google Search) and Google Places API.
Functionality: The agent is able to answer multi-stage queries, including searching for information and looking up addresses.
Components: Model, orchestration, and tools, all working together.

8. Production Applications with Vertex AI Agents

Focus: Integration of agents with user interfaces, evaluation frameworks, and continuous improvement mechanisms.
Vertex AI Platform: Offers a fully managed environment for building production-grade agents.
Features: Natural language interface, tool definitions, sub-agent task delegation, testing and evaluation tools, managed infrastructure.
Key takeaway: Simplifies agent development, allowing developers to focus on functionality.

9. Summary & Key Takeaways

Agents as Extensions: Agents extend LLMs by enabling them to leverage tools for real-time access, actions and autonomous task planning.
Orchestration Layer: The key driver for agent operation, structuring reasoning and decision making with frameworks such as ReAct, CoT and ToT.
Tools: Extensions, Functions and Data Stores provide the means for agents to interact with the real world. Extensions bridge the gap with external APIs. Functions offer client-side control. Data Stores provide access to dynamic data.
Future Direction: "Agent chaining," which combines specialised agents, is a promising approach for solving complex problems in different domains.
Iterative Development: Building complex agents requires experimentation and refinement.

In Conclusion: This whitepaper provides a good overview of how Generative AI agents work, their core components and how they interact to execute specific tasks. The distinction between models and agents is clear, and the examples for how tools can be implemented are valuable. It highlights the importance of the orchestration layer, cognitive architectures and targeted learning in enhancing model performance and how agents are evolving to address complex problem-solving scenarios.

SerpApi

2 ä¸ªæœˆ

Thank you for sharing and for mentioning SerpApi in one of the examples!

èµž

å›žå¤

Gerald (gerry) D'Agostino

Co Founder at Starting Partners

2 ä¸ªæœˆ

Thanks for sharing this Neil

èµž

å›žå¤

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Neil Kingçš„æ›´å¤šæ–‡ç«

An Overview of Approaches to Monitoring GenAI Traffic for Data Protection and Security

2025å¹´2æœˆ5æ—¥

An Overview of Approaches to Monitoring GenAI Traffic for Data Protection and Security

As organizations increasingly adopt Large Language Models (LLMs) like ChatGPT, Claude, and others into their workflows,â€¦
GitHub Copilot: A Security review

2024å¹´6æœˆ1æ—¥

GitHub Copilot: A Security review

GitHub Copilot is seen by some to be a game-changer for developers, offering real-time code suggestions powered by aâ€¦
Why did I join Cyberhaven - 6 Month Update

2022å¹´11æœˆ28æ—¥

Why did I join Cyberhaven - 6 Month Update

It has been about 6 months since I started as Chief Product Officer at Cyberhaven. Back then, I wrote a post Why did Iâ€¦
Why did I join Cyberhaven?

2022å¹´6æœˆ3æ—¥

Why did I join Cyberhaven?

Why did I join Cyberhaven? After four amazing years at Netskope, where I led a portfolio of key products in cloudâ€¦

15 æ¡è¯„è®º

Neil Kingçš„æ›´å¤šæ–‡ç«

An Overview of Approaches to Monitoring GenAI Traffic for Data Protection and Security

GitHub Copilot: A Security review

Why did I join Cyberhaven - 6 Month Update

Why did I join Cyberhaven?

ç¤¾åŒºæ´žå¯Ÿ