登录查看更多内容

Why We Need LLM Output Gatekeeping - Policing AI Agents

Rick Munoz

Chief Architect and CTO at T4S Partners

发布日期: 2025年2月28日

In Computerphile's recent video on "Generative AI's Greatest Flaw," Mike Pound shows how indirect prompt injection (including hidden injections) could lead LLMs and AI Agentic flows to perform unintended or malicious actions. Imagine a financial AI Agent being misdirected to swap account numbers or increase an amount in an account transfer. Or worse.

Prompt injection attacks can take many forms (e.g., Jailbreaking, Sidestepping, Obfuscation). These attacks can now also enter through multiple paths, including manipulated RAG-generated context, file metadata manipulation, API poisoning, image steganography, etc. These attack vectors on AI will only become more sophisticated and stealthy as AI expands to perform more real-world actions, especially autonomous agent-driven actions.

Research on these attack vectors has focused on testing for indirect prompt injections, partly by using AI to generate indirect prompt injections to test against. A recent Google Security blog post describes a threat model and evaluation framework for indirect prompt injection attacks. They also built several "attack tools" for that threat evaluation framework.

Of course, the first line of defense is training or fine-tuning the LLM to detect and prevent indirect prompt injection. In their "Adversarial Machine Learning" paper from 2023, NIST describes several mitigation approaches:

Fine-tuning using Reinforcement Learning from Human Feedback (RLHF)
Filtering inputs
LLM Moderation (prompt review & analysis by LLM)
Interpretability-based solutions (detection of outliers in prediction trajectories)

These are all useful, but they only address the input to the LLMs, not the output. What is also needed is checking the output of the LLMs, especially instructions to action-performing agents. A group of University of Illinois researchers describe how they used Output Parsing to evaluate the 'attack success rate' of indirect prompt attacks. Output parsing and analysis are key in protecting Agentic AI systems. That is, the output of an LLM should always be treated with suspicion before it is allowed to be used by an agent to drive tools and other actions.

In short, Agentic AI applications should employ gatekeepers—a form of output policing—to help ensure that the output does not lead to malicious or unexpected agent behavior. Yes, this increases the application's cost, but this needs to be weighed against the risk of unforeseen prompts resulting in loss or damage.?

But how should these gatekeepers work? We want the gatekeepers to be as effective as the wizard Gandalf ("You shall not pass") and not to be as ineffective as the Black Knight in Monty Python and the Holy Grail ("None shall pass").? Last month, OWASP published "LLM05:2025 Improper Output Handling", which lists several mitigation strategies, including:

Semantic Filtering and Context-aware encoding - check the output, especially code, for potential vulnerabilities
Output validation & sanitization - Detecting and removing suspicious or sensitive content (similar in concept to SQL Injection protections.)
Access control enforcement - Determining if the output is asking for access and enforcing appropriate access control

Here are two more to potentially add to that list:

Training an adversarial LLM to be a gatekeeper. Ideally, this LLM would 'know' what is allowed and what isn't, detecting invalid or inappropriate output and then rejecting it or forcing the source LLM to 'try again' (and notifying). This would be analogous to how credit card issuers detect fraud using ML pattern matching, anomaly detection, and risk scoring.
Simulate the actions the output tells the agent to perform and check if the outcomes are acceptable, unacceptable, or suspicious.

These approaches may seem draconian or overkill, but as agents grow more autonomous and powerful, their power needs to be checked, especially if they can perform actions that affect people's well-being.? I am sure there are many new ways to mitigate the dangers of prompt injection. Now, we need to make sure they work well.?

If you have experience using any of the above approaches or know of other indirect prompt injection mitigation strategies being used, please share your experiences in the comments.

#ArtificialIntelligence #LLM #LargeLanguageModels #PromptEngineering #AIResearch #MachineLearning #AgenticAI #AIAgents #TechInnovation #AITrends #SemanticAI #FutureOfAI

要查看或添加评论，请登录

Rick Munoz的更多文章

Dividing Event-Sourcing into nanoservices

2025年2月20日

Dividing Event-Sourcing into nanoservices

A few years ago, I led the design of a cloud-native financial management system meant to run globally, with users in…
Which Low-code development platform is best?

2025年1月30日

Which Low-code development platform is best?

Are you grappling with deciding which low-code application development platform to use? The landscape of low-code (or…

2 条评论
DeepSeek's AI improves when rewarded

2025年1月28日

DeepSeek's AI improves when rewarded

Reinforcement Learning (RL) has been a popular approach to training and improving AI models. Here's an…

2 条评论
AI can't learn without asking

2025年1月25日

AI can't learn without asking

I ran across this great video from IBM Technology on "Can AI Think? " It points out that LLMs (and neural networks in…

1 条评论
Meta AI's LCM: a good evolutionary step

2025年1月20日

Meta AI's LCM: a good evolutionary step

Meta's recent paper on Large Concept Models (LCMs) is being heralded as the next generation of AI technology. This is…
Three Keys to Successful App Transformation

2023年3月20日

Three Keys to Successful App Transformation

How old are your applications? Your database systems? Is any of your technology nearing (or way past) its end-of-life…

See all articles

Rick Munoz的更多文章

Dividing Event-Sourcing into nanoservices

Which Low-code development platform is best?

DeepSeek's AI improves when rewarded

AI can't learn without asking

Meta AI's LCM: a good evolutionary step

Three Keys to Successful App Transformation

社区洞察