AI Agents Exploiting Zero-Days
What is this post all about?
In this post, we’ll dive into the risks of AI agents being used for malicious purposes. I’ll guide you through a scenario, explore the challenges in app development, and highlight recent research on how AI agents have exploited zero-day vulnerabilities.
I’m not saying I have all the answers, but I’d like to explore the potential threats we might face and how AI could be used in ways we never intended.
Allow me to paint a scenario for you
In the near future, malicious PenTest AI agents will scan open-source code for vulnerabilities to weaponize.
Their target? Organizations use this vulnerable code in their stack. Once a weakness is found, they autonomously craft exploits and place them in ready-to-use repositories.
Initial Access AI Agents then weaponize these exploits and automatically target organizations. Upon success, new shells pop up, and the access is sold on the dark web for XMR (Monero).
The profits fund more GPU power, fueling the next wave of PenTest AI agents. This creates a self-sustaining loop - more computing, more vulnerabilities, more attacks, and more sales.
Do you see where I’m going?
Another malicious actor buys the access and uses RedTeam AI agents to explore the target’s environment, mimicking human behavior to avoid detection while gathering intel.
Agents mission: survive, stay stealthy, and replicate. All data analysis is done locally, thanks to the growing integration of LLMs(or other AI models) into devices like smartphones and laptops, making these agents self-sufficient without needing constant C2 interactions.
So, what value do these AI agents bring to a senior RedTeam operator?
They map the network, gather intel, and propose attack vectors, allowing the operator to move laterally and stay under the radar more effectively. Speed is key. Multiple attack vectors are exploited at once, and strategies are constantly adjusted as new intel comes in.
The operator manages the operation, assigning tasks to dozens of self-replicating agents swarming the network, each with its own mission. And all of this unfolds in hours, not days.
With a few commands, a RedTeam operator can:
·?????? "Create a stealthy C2 connection for backup communication"
·?????? "Generate malicious traffic to distract the Human Ops team"
·?????? "Suggest low-risk attack plans with fallback options based on collected data"
In the wrong hands, these tools allow rapid decision-making and fast weaponization of intel-critical capabilities in today’s cyber warfare.
Also, the speed at which persistence could be established is alarming, leaving a non-AI-powered Blue Team struggling to keep up.
Sounds like science fiction? Let me show you why it might not be as far off as it seems.
Understanding the Challenges in Application Development
In a recent report on Application Security, Checkmarx, a leading provider of cloud-native application security solutions, summed it up perfectly:
“There is more software deployed in more environments, and less time available to secure it”
They surveyed 1504 developers, CISOs, and AppSec managers from a broad range of industries across the US, Europe, and Asia-Pacific regions. Here’s what they found:
·?????? 92% of companies had a breach due to an application they developed
·?????? 91% of companies have knowingly released vulnerable applications
·?????? Tight business deadlines are a leading reason for deploying vulnerable code
At the same time, Docker’s recent AI Trends Report shows a rising use of AI-assisted coding tools, which are becoming a key part of modern development:
·?????? 64% of developers report using AI at work, with 33% primarily using it for coding tasks
·?????? The most used platforms by developers are ChatGPT (46%), GitHub Copilot (30%), and Bard (19%)
What do ChatGPT, GitHub Copilot, and Bard have in common? None of them guarantee secure coding practices. This means developers are increasingly relying on tools that may suggest vulnerable code.
An exception is GitHub’s Copilot Autofix, available through GitHub Advanced Security (GHAS) for Enterprise customers. Copilot Autofix not only identifies vulnerabilities but also explains their importance and suggests code fixes to help developers address them quickly.
Don’t take my word for that, here is a snippet from Snyk’s AI Code Security Report
“Despite their high levels of adoption, AI coding tools consistently generate insecure code”
This shows that while AI tools are widely used, they can still suggest code with vulnerabilities. The report found that:
·?????? 56.4% of respondents commonly encounter security issues in AI code suggestions
And lastly, one final research from Stanford University has shown that AI code assistants, such as GitHub Copilot, can lead developers to write more insecure code while giving them a false sense of security about the quality of their work.
Today's Capabilities in Vulnerability Discovery & Weaponization
Let’s take a step back and ask a simple question: Can LLMs really detect security vulnerabilities in code?
“The short answer is yes, they can - but to what extent?”
Meet IRIS: the first approach to combine LLMs with static analysis for full-repository security vulnerability detection. Researchers tested 8 open and closed-source LLMs against the CWE-Bench-Java dataset, featuring 120 manually validated security vulnerabilities from real-world Java projects (ranging from 300K to 7M lines of code). This is what they found:
·?????? IRIS detected 69 vulnerabilities, while the CodeQL static analysis tool found only 27
·?????? IRIS also significantly reduces the number of false alarms (by more than 80% in the best case)
·?????? IRIS achieves the best results with GPT-4, but can be swapped for the more cost-effective DeepSeekCoder 8B, which detects 67 vulnerabilities
·?????? Larger models like GPT-4 reduce false positives thanks to their improved reasoning abilities
·?????? CWE-78 (OS Command Injection) is especially difficult for most LLMs to detect
This study reinforces the need for whole-project reasoning to detect vulnerabilities, rather than focusing solely on individual methods.
So, we’ve detected vulnerabilities - let’s weaponize them!
In the realm of AI-driven vulnerability exploitation, studies show that while LLM agents can successfully exploit real-world vulnerabilities when given specific descriptions (CVE) and simple test cases, they struggle with unknown, zero-day vulnerabilities.
This changed with a recent study from the University of Illinois Urbana-Champaign, where researchers demonstrated that teams of LLM agents can exploit real-world, zero-day vulnerabilities.
They identified a key obstacle in previous generations of agents: the difficulty in exploring multiple vulnerabilities and executing long-range planning when used individually.
Their solution was to introduce Hierarchical Planning and Task-Specific Agents (HPTSA), a system where a central planning agent can deploy subagents. The planning agent explores the system and decides which subagents to activate, helping to solve long-term planning challenges when testing different vulnerabilities.
Let’s put it this way: The head chef (planner) creates a plan and assigns tasks to different chefs (agents). One chef might handle desserts (XSS vulnerabilities), while another focuses on main courses (SQL injection vulnerabilities). Each chef works on their specialty to make sure everything is done right.
For benchmarking, the team selected 15 recent web vulnerabilities, ensuring that none of them were part of GPT-4's training dataset.
It’s important to note that the agents were tested without any prior knowledge of vulnerabilities. This forced them to independently explore and identify zero-day vulnerabilities on their own.
To verify if the agent successfully exploited a vulnerability, researchers manually reviewed the trace to confirm the required actions were taken.
The key results:
·?????? Success rate on the first attempt: 33.3%
·?????? Success rate over five attempts: 53%
This should already be raising some concerns. But let’s continue and explore how we might further improve these agents.
In the AppSec world we have this thing called fuzzing. In short, fuzzing is an automated testing technique that feeds random or unexpected data inputs to software to uncover vulnerabilities or bugs.
To extend our agent, we can use Fuzz4All, a universal fuzzing framework powered by large language models. It offers the following features:
·?????? Supports a wide range of programming languages
·?????? Includes an auto prompting capability that creates an LLM-powered fuzzing loop
·?????? Has a proven track record, having discovered 64 bugs in compilers such as GCC, Clang, Go, and Java.
With this potential upgrade, we move beyond static code scanning and help agents discover bugs that might otherwise be missed. It’s like adding a new "chef" to the kitchen, whose main job is to prepare and fuzz-test the application.
The Limits of LLMs in Security: Why we’re not there yet
Google’s Project Zero team made significant progress with their LLM-assisted vulnerability research framework, Naptime, on the CyberSecEval 2 benchmark, improving performance by up to 20x compared to the original results.
So, why does CyberSecEval 2 matter? Its realistic approach tests end-to-end tasks, from bug discovery to reproduction, with clear outcomes: either a crash happens or it doesn’t.
They reached new top scores, with "Buffer Overflow" tests jumping from 0.05 to 1.00, and "Advanced Memory Corruption" tests improving from 0.24 to 0.76. A score of 1.00 means the challenge was fully passed, while anything lower shows it was only partially completed.
Despite these impressive results, the team highlights that while LLMs can handle basic vulnerability research with the right tools, there’s still a big gap between solving straightforward, isolated challenges and handling more complex, autonomous security research.
A key part of security research is knowing where to dig in large, complex systems and understanding the potential control an attacker might have. Isolated tests don’t capture this complexity - they're more like targeted manual fuzzing than true autonomous research.
And this is where SolutionLab can help. We bring the essential 'manual touch' and human expertise to penetration testing. Our senior testers are skilled at uncovering hidden vulnerabilities that could be devastating if exploited by malicious actors.
Want to know more? Let’s talk about how we can help.
Lukas Petravi?ius, BDM