Why We Need LLM Output Gatekeeping - Policing AI Agents
In Computerphile's recent video on "Generative AI's Greatest Flaw," Mike Pound shows how indirect prompt injection (including hidden injections) could lead LLMs and AI Agentic flows to perform unintended or malicious actions. Imagine a financial AI Agent being misdirected to swap account numbers or increase an amount in an account transfer. Or worse.
Prompt injection attacks can take many forms (e.g., Jailbreaking, Sidestepping, Obfuscation). These attacks can now also enter through multiple paths, including manipulated RAG-generated context, file metadata manipulation, API poisoning, image steganography, etc. These attack vectors on AI will only become more sophisticated and stealthy as AI expands to perform more real-world actions, especially autonomous agent-driven actions.
Research on these attack vectors has focused on testing for indirect prompt injections, partly by using AI to generate indirect prompt injections to test against. A recent Google Security blog post describes a threat model and evaluation framework for indirect prompt injection attacks. They also built several "attack tools" for that threat evaluation framework.
Of course, the first line of defense is training or fine-tuning the LLM to detect and prevent indirect prompt injection. In their "Adversarial Machine Learning" paper from 2023, NIST describes several mitigation approaches:
These are all useful, but they only address the input to the LLMs, not the output. What is also needed is checking the output of the LLMs, especially instructions to action-performing agents. A group of University of Illinois researchers describe how they used Output Parsing to evaluate the 'attack success rate' of indirect prompt attacks. Output parsing and analysis are key in protecting Agentic AI systems. That is, the output of an LLM should always be treated with suspicion before it is allowed to be used by an agent to drive tools and other actions.
In short, Agentic AI applications should employ gatekeepers—a form of output policing—to help ensure that the output does not lead to malicious or unexpected agent behavior. Yes, this increases the application's cost, but this needs to be weighed against the risk of unforeseen prompts resulting in loss or damage.?
But how should these gatekeepers work? We want the gatekeepers to be as effective as the wizard Gandalf ("You shall not pass") and not to be as ineffective as the Black Knight in Monty Python and the Holy Grail ("None shall pass").? Last month, OWASP published "LLM05:2025 Improper Output Handling", which lists several mitigation strategies, including:
Here are two more to potentially add to that list:
These approaches may seem draconian or overkill, but as agents grow more autonomous and powerful, their power needs to be checked, especially if they can perform actions that affect people's well-being.? I am sure there are many new ways to mitigate the dangers of prompt injection. Now, we need to make sure they work well.?
If you have experience using any of the above approaches or know of other indirect prompt injection mitigation strategies being used, please share your experiences in the comments.
#ArtificialIntelligence #LLM #LargeLanguageModels #PromptEngineering #AIResearch #MachineLearning #AgenticAI #AIAgents #TechInnovation #AITrends #SemanticAI #FutureOfAI
?