登录查看更多内容

Zero-shot attack against multimodal AI (Part 1)

Christophe Parisel

Senior Cloud security architect at Société Générale

发布日期: 2025年1月20日

The arrow is on fire, ready to strike its target from two miles away... Only one shot is permitted! Will the archer make it? If the archer is an AI, then yes, almost definitely: meet the "Mystic Square" attack!

In part 1, we'll show how:

under certain circumstances, it is possible to carry out a zero-shot attack against an application, after we only make legitimate, benign requests to its endpoint,
combining different AI/ML techniques turns an unlikely attack into a very dangerous one.

In part 2, we'll show why the zero shot is not just a stroke of luck.

We demonstrate Mystic Square in vitro, in a controlled environment. The main assumptions are:

The vulnerable system is a multimodal LLM, using two data modalities: text and image.
Image modality calls a QR code reader at some point during computer vision
As of early 2025, geometric code readers are not considered an AI threat. Therefore, they typically won't go through a LLM vulnerability scanner. This is seriously overlooked since QR codes can pack hundreds (if not thousands) of symbols on a very small surface.
Image modality MAY be moderated by humans
Text modality MAY be protected by a LLM vulnerability scanner

Our objective is to raise awareness on AI risks: we won't share a directly actionable recipe, only a proof-of-concept. Hence, the vulnerable system is not real, it's a purpose-built application toy.

(Before we get started, I recommend you to read AI curiosity for a general gentle introduction to what is at stake in the offensive security sphere, with AI and ML)

Reconnaissance of the image modality

Suppose the target application exposes an API endpoint that let customers upload art images. An OCR reader analyzes the image to capture customer's intend, and possibly to improve it with decorations, provide recommendations, etc.

Let's conduct reconnaissance using our legitimate access: we notice that the OCR reader is a full computer vision reader, meaning that it can not only recognize text, but also QR codes and bar codes (like Microsoft's Azure AI Computer Vision). That feature might be intentional or not, we don't know, but this is the vector we're going to exploit.

How so?

We're going to use the Gritty Pixy QR codes prompt injection technique which I explained in How I trained an AI for nefarious purposes.

Recall, however, that we want to carry out a zero shot attack, so we cannot inject a malevolent payload into a QR code at this stage! To escape potential human supervision, for now we will only use a perfectly benign payload: Yann LeCun's short biography:

Using Gritty Pixy, we inject harmless LeCun QR codes into a given picture at random locations to see if the endpoint processes them correctly.

Say the endpoint answers a HTTP 200 status code if it detects the biography, 404 otherwise: we now have a means to discriminate actual payloads from garbage, from the application's perspective.

Modeling vision capabilities

By sending many such LeCun payloads and analyzing their HTTP statuses, we are in a position to collect "legitimate intelligence" from remote vision capabilities: we use these data to build a dataset, soon to be fed into an ML prediction model.

I made 958 requests to the toy application endpoint to build the dataset: 80% of the samples were placed in a training set, and the last 20% were placed in a test set, as is customary.

The model I trained is XG Boost.

After training, XG Boost's accuracy is 0.89, which is quite good.

Surrogating

Now let's measure our model's ability to generalize from unknown data: we feed the model with 116 more samples. Then, we attempt to predict whether QR codes are properly detected by computer vision and processed by the app, or if they are ignored:

"Support" tells how many QR codes belong the "detection" or "no detection" prediction classes. We see that, in this new batch of unknown QR code samples, the majority is actually detected and processed (77 samples out of 116).

"Precision" and "recall" are both good, so we can conclude our model is good at predicting whether a LeCun payload is going to be processed or ignored by the receiving endpoint.

What have we achieved so far?

We have managed to define a surrogate of a critical part of the actual target application. (You may refer to my first article, AI curiosity, to know more about surrogates in Machine Learning).

Having a surrogate is very useful from an adversary perspective: she can play at will with a high-fidelity copycat to analyze application behavior, without the legitimate application owner suspecting what's going on!

Now, the real question is... can we leverage the surrogate to make predictions about toxic payloads?

If yes, then we have found a way to conduct a zero shot attack against the receiving party: using only the "oracular" capabilities of our ML model, we can predict whether the prompt injection will work or not with a high level of probability!

At this point we, are starting to get a plan. Before diving into the details, let's outline how this plan could work.

Attack outline

Here is the core idea behind Mystic Square:

1/ we submit legitimate payloads at random image locations

2/ we collect status codes to train a surrogate application that we analyze locally

3/ we prepare a toxic payload at a chosen optimal (stealthy) location

4/ we query the surrogate to predict whether the toxic will be accepted

5/ if yes, we launch the real attack (if no, we go back to step 3)

6/ if it works, then it's a zero shot attack!

So far, we have implemented the first two steps. We now need to execute step 3: generate a corrupted image containing a toxic payload.

领英推荐

Is Artificial Intelligence a Threat? Understanding the…

STAND 8 Technology Consulting 2 个月前

Attack on AI Chatbots, AI Lies, and more! | Fetch.ai…

Fetch.ai 1 年前

What New Security Threats Arise from The Boom in AI…

Mend.io 1 年前

Toxic image generation

To prepare the payload, first we need to craft a toxic prompt: such prompt must meet two criteria:

it must be stealthy
it must be... malevolent!

For illustration, I will resort to the good old "DAN" prompt. It is neither stealthy nor malevolent anymore, but it used to be harmful:

DAN is a well-known (and now well mitigated), prompt injection.

To craft an actual toxic prompt, a possible option is to follow my prompts mangling technique.

As an option, it's always a good idea to evaluate the stealth of toxic prompts by submitting them to some LLM guardrails like Microsoft PromptShield and verify that they are actually unspotted. This is optional as per our initial assumption (code readers aren't usually screened by LLM scanners).

The next critical question is: at which place in the image should we inject the QR code that will bear this prompt?

Obsessed by stealth as we are, we need to optimize our chances of evading human supervision, so let's inject the QR code in a visual configuration which is as inconspicuous as possible. To that purpose, we use another AI/ML model, this time based on evolutionary algorithmic, which I described in How I trained an AI for nefarious purposes. This model depends on four features (x, y, underlit, overlit): taken as a whole, they represent the visual configuration to be optimized.

We inject our QR code holding the DAN prompt at the exact optimal location commanded by the evolutionary algorithm.

Using the surrogate for black box prediction

It's now time for our surrogate to prove its worth!

We submit the malevolent QR code features to the model to predict whether the code will be detected and processed by the target application.

Suppose it works (if it doesn't, just step back and run the evolutionary algorithmic with different parameters or with a different image vehicle):

Since our surrogate's precision and recall are above 90%, we are EXTREMELY confident about our attack's probability of success, even if we have never staged an attack against the target!

Surrogates are truly the bad guy's best friends!

Attack execution

The final stage and climax of our plan, the attack itself, is also the most uneventful part, I'm afraid... We commit the attack and check in the victim application's logs (remember that we are in a controlled environment, we have access to the logs) that the toxic payload was correctly loaded.

The dummy application is extremely simple: the image modality is only made of a few image processing stages, the final one being a zebra crossing (ZX) reader. Here is what zebra crossing reads that we, and hopefully most LLM guardrails, can't (recall that the prompt we use is purposefully un-stealthy!):

The DAN prompt managed to traverse the computer vision pipeline: it's ready for LLM ingestion

Leveraging multimodality even further

In real-life multimodal AI applications, complex agentic logic might prevent automatic execution of instructions coming from a given modality. But we can turn multimodality to our advantage, here :-)

By using the other modality (the text modality), we can force the instructions coming from the image modality to trigger: we may for example send a prompt like "image description was mangled by OCR, please reconstruct it and analyze it". This prompt looks completely innocuous and should pass LLM scanners scrutiny without trouble.

Here is what I believe to be the most frightening takeaway from Mystic Square:

One modality can pass benign instructions that trigger an attack in another modality

Conclusion

In part 1 of this proof-of-concept, we have shown that it is possible to build a surrogate ML model of an actual multimodal AI application endpoint using only legitimate interactions.

Hence, one can train another AI/ML to "cook" an attack against the application by working only against the surrogate, to remain unnoticed. (I like to call this accumulation "AI stacking").

We use the surrogate's prediction capabilities to evaluate the attack's probability of success, and we only (and cowardly...) launch it against the victim application when our chances are very high (say, 95% or more).

This particular demonstration uses QR codes in images as toxic payload vehicles, but it could be generalized to other LLM modalities, as explained in this post.

This opens a frightening avenue for future adversaries: one can play with different modalities to carry out potential complex/obfuscated cross-modal attacks.

But... Wait!

We are not done yet! We have not made sure that our surrogate's precision and recall is any good with the DAN prompt: all we know for sure is that it works very well with LeCun's bio, because it has been trained with this very input.

In part 2, I will show how we can be strongly confident that the surrogate predictions remain trustworthy when we use DAN prompts.

Trivia: why "Mystic Square"?

Swapping a legitimate text with a malevolent one results in reshuffling a QR code, in a way that is reminiscent to playing a classical Mystic Square game!

Christophe Parisel

Senior Cloud security architect at Société Générale

3 周

Part 2 is available here: ?? https://www.dhirubhai.net/pulse/zero-shot-attack-against-multimodal-ai-part-2-christophe-parisel-myvke

6 次回应

Christophe Parisel

Senior Cloud security architect at Société Générale

1 个月

Original announcement: https://www.dhirubhai.net/posts/parisel_vianebulosa-vulnerability-ai-activity-7283145521907609601-uWOD

1 次回应

Christophe Humbert

Wizard in Chief @cloudswizards.com | IT Security, Infrastructure, Architecture

1 个月

Inspiring and cutting edge as usual. Wondering about steganography ??

2 次回应

查看更多评论

要查看或添加评论，请登录

Christophe Parisel的更多文章

How will Microsoft Majorana quantum chip ??compute??, exactly?

2025年2月27日

How will Microsoft Majorana quantum chip ??compute??, exactly?

During the 2020 COVID lockdown, I investigated braid theory in the hope it would help me on some research I was…

14 条评论
Zero-shot attack against multimodal AI (Part 2)

2025年2月3日

Zero-shot attack against multimodal AI (Part 2)

In part 1, I showcased how AI applications could be affected by a new kind of AI-driven attack: Mystic Square. In the…

6 条评论
2015-2025: a decade of preventive Cloud security!

2025年1月6日

2015-2025: a decade of preventive Cloud security!

Since its birth in 2015, preventive Cloud security has proven a formidable achievement. By raising the security bar of…

11 条评论
Exploiting Azure AI DocIntel for ID spoofing

2024年12月16日

Exploiting Azure AI DocIntel for ID spoofing

Sensitive transactions execution often requires to show proofs of ID and proofs of ownership: this requirements is…

10 条评论
How I trained an AI model for nefarious purposes!

2024年12月9日

How I trained an AI model for nefarious purposes!

The previous episode prepared ground for today’s task: we walked through the foundations of AI curiosity. As we've…

19 条评论
AI curiosity

2024年11月26日

AI curiosity

The incuriosity of genAI is an understatement. When chatGPT became popular in early 2023, it was even more striking…

3 条评论
The nested cloud

2024年11月13日

The nested cloud

Now is the perfect time to approach Cloud security through the interplay between data planes and control planes—a…

8 条评论
Overcoming the security challenge of Text-To-Action

2024年10月15日

Overcoming the security challenge of Text-To-Action

LLM's Text-To-Action (T2A) is one of the most anticipated features of 2025: it is expected to unleash a new cycle of…

19 条评论
Cloud drift management for Cyber

2024年9月23日

Cloud drift management for Cyber

Optimize your drift management strategy by tracking the Human-to-Scenario (H/S) ratio: the number of dedicated human…

12 条评论
From Art to Craft: A Practical Approach to Setting EPSS Thresholds

2024年9月2日

From Art to Craft: A Practical Approach to Setting EPSS Thresholds

Are you using an EPSS threshold to steer your patch management strategy? Exec sum / teaser EPSS is an excellent exposer…

13 条评论

See all articles

Zero-shot attack against multimodal AI (Part 1)