The Prompt Injection Mitigation Problem is Never Going Away.

The Prompt Injection Mitigation Problem is Never Going Away.

Prompt injection has become an emerging threat on the cybersecurity horizon since the explosion of Large Language Models (LLMs) like ChatGPT into the public eye.


It's a scenario where engineers are integrating LLMs into their services and attempting to prevent prompts from achieving certain outcomes and users are finding ways to tailor the prompt to bypass these limitations.


People are referring to prompt injection like something temporary to be mitigated and worked around, like SQL injection. I aim to convince you that this is not the case and that user input must never be used as a trust boundary. And more fundamentally, I lay out what a “trust boundary” really is so that you can avoid the anti-pattern in your software development ventures.



Bottom Line Up Front

As always, for those interested in the conclusion without the explanation, here it is. Put simply:


My best advice is: Treatment of any user input as a trust boundary is a bad idea. It's not something to mitigate. It's something to avoid.


But, what is a trust boundary, anyway?


Since writing this article, I found probably the best technical description for the prompt injection mitigation problem in an article by Rich Harang :


“At a broader level, the core issue is that, contrary to standard security best practices, ‘control’ and ‘data’ planes are not separable when working with LLMs.”


In this article, I'll attempt to illuminate this problem down to its deepest roots by exploring the aforementioned trust boundary and why it's so critical to this problem.


LLMs can be used to accomplish incredible things. Making reliable trust decisions is not one of them. They are vulnerable to social engineering for fundamentally the same reasons that humans are.



Prompt Injection Mitigation is Futile.

No alt text provided for this image
Generated by Midjourney AI.


Alright, maybe futile was an severe term to use. As I'll explore in future articles and software, it's likely that robust mitigation of prompt injection is actually possible. However, a fundamental challenge will never go away: We'll never be sure. And that's the problem this article aims to explore.


And the good news is this: This problem can be entirely side-stepped in most use-cases. There are just a few challenging scenarios where it can't be. And, well... Also there's the very likely chance that most developers use the mitigations as fix-alls instead of sidestepping the problem through good trust boundary design.


So I predict that the problem I'm about to lay out will eventually result in Log4Shell tier cybersecurity global crisis events.



A Comparison to SQL Injection


Let’s start with the classic injection vulnerability: SQL. Ironically, SQL injection introduced exactly the same failure mode we face with prompt injection, but it didn’t need to. SQL was a bad implementation (hindsight being 20/20) of a good idea.


To cite?portswigger.net:


"Consider a shopping application that displays products in different categories. When the user clicks on the Gifts category, their browser requests the URL:


https://insecure-website.com/products?category=Gifts        


This causes the application to make a SQL query to retrieve details of the relevant products from the database:


SELECT * FROM products WHERE category = 'Gifts' AND released = 1        


The application doesn't implement any defenses against SQL injection attacks, so an attacker can construct an attack like:


https://insecure-website.com/products?category=Gifts'--        


This results in the SQL query:


SELECT * FROM products WHERE category = 'Gifts'--' AND released = 1        


The key thing here is that the double-dash sequence -- is a comment indicator in SQL, and means that the rest of the query is interpreted as a comment. This effectively removes the remainder of the query, so it no longer includes AND released = 1. This means that all products are displayed, including unreleased products."


This illustrates a simple example of how injection attacks work.


To understand why this was a bad implementation of a good idea, we'll need to boil this problem down to fundamental tenets. Let's dive in.



Fundamental Tenets of Computing


In computing, virtually all programs are comprised of a few fundamental concepts. We abstract these in various ways and give them various names, but the core elements are the same.



State: Sequences of 1’s and 0’s we store somewhere.

We might refer to discrete subsets of state as “data”, "variables", "registers", "files", "integers", "strings", "flags", "bytes", "bits", and so on.



Functions: Logic we use (encoded as data, circuitry, or both) to achieve some change in state.

Often referred to as "instructions", "operators", "methods", "routines", and so on.



Arguments: State (data) we input to functions.

Also referred to as "variables", "configurations", "objects", and so on.




Tip: I've underlined each synonym so when you see it later, you'll know it's just another word for "state", "function", or "argument".



Examples

These concepts are fairly constant...


...at the lowest levels of abstraction, like this assembly code to call an instruction imul (x86 assembly's multiplication function) with the arguments ax and bx, which are also variables:


; imul is an instruction for multiplication in assembly

mov ax, 2??; Assign the integer 2 to the register ax. (first argument, 2)

mov bx, 2??; Assign the integer 2 to the register bx. (second argument, 2)

imul ax, bx?; Multiply values (ax and bx registers, 2 * 2).

mov X, ax??; imul's output gets assigned to the ax register in assembly.        


...all the way to the highest levels of abstraction, like this JavaScript which does the same thing:


Here's the simplest version, where the operator (*) serves as the function, with the two integers (2 and 2) serving as arguments.

// In JavaScript instead of an instruction we have an "operator",
// but it's the same concept as a function or instruction.

var X = 2 * 2        


I'll re-write the code in a roundabout way so that non-developers can make a more direct comparison to the assembly code provided a moment ago.


function imul(num1, num2) { return num1 * num2 }; // Named like assembly.

var ax = 2; // Assign the integer 2 to the variable ax.

var bx = 2; // Assign the integer 2 to the variable bx.

ax = imul(ax, bx); // Assign the output of the function to the ax variable.

var X = ax; // Assign ax to X.        


So now that we've reviewed some fundamental concepts, let's see how they come together to create trust boundaries...



Trust Boundaries


Trust boundaries in computing at the most fundamental level are just branching outcomes with unequal trust. In other words, you might want a user to be able to trigger one outcome, but not the other.


Branching Outcomes

In a program, there are almost always branching outcomes. A branching outcome occurs when a function is called because the instruction pointer (the address that tells the CPU which instructions to execute next) has been changed based on a condition.


var ZF = 0;
if (ZF === 0) 
? doSomething();
} else {
? doSomethingSelse();
}{        


Now the same concept in assembly:


The jz instruction checks the zero flag (a flag is typically just a single bit) of the FLAGS register. If it's set (ZF = 1), the jump is taken. In this case, ZF is just a hidden argument.


jz doSomething ; If the ZF bit is 0, the CPU will jump to doSomething.
jmp doSomethingElse ; Otherwise the next instruction will doSomethingElse.        


Now, what if doSomething was a privileged outcome and doSomethingElse was not? The two would be unequal trust outcomes which we would want to gate behind controls on the asset (person or machine) triggering them.



The Trust Boundary

Whatever functions control the state of ZF in the above branching outcome examples are the trust boundaries. There might realistically be many functions involved in reaching that value, but fundamentally we can think of all of them as one composite function, like:


var ZF = authenticate(argument1, argument2);        


This is the central mechanism that all of cybersecurity is centered around. This single challenge is what all cybersecurity related labor, skills, and products are supporting either directly or indirectly: Avoiding a mismatch between trust in the asset (person or machine) and the branching outcomes triggered.


Injection (Finally)...


Finally, we tie all of these concepts together into the problem of injection.


We're almost done, I promise.


Before we conclude with prompt injection, we circle back to SQL, as promised, to understand how everything we've covered so far comes together to create the problem of SQL injection.



SQL Injection: A Blurry Trust Boundary


SQL is simply a string (an argument) parsed (by a function) in order to query stored information (state). Usually some of the information is privileged, and some isn't. By choosing to use a single argument (a scripting language called SQL) instead of accepting discrete values (multiple separate arguments), engineers created a scenario where the trust boundary was the argument.


In other words, when using SQL, the argument passed to the query function is an entire scripting language, which is effectively a function encoded as a string. So the user input effectively becomes a function within the trust boundary.


You're essentially giving the user remote code execution and then clawing it back. It doesn't help that as a result of the input being an entire scripting language, the function (a trust boundary) which parses this "argument", must also be very complex.


So to put it simply: The result is a very blurry trust boundary. And "blurry" is not a word you ever want to see in the same sentence as "trust boundary".



SQL Injection: Mitigation

If you've read this far, you're undoubtedly familiar with the term "user input sanitization". This essentially means performing some processing on the input (argument) to ensure it won't trigger the wrong branching outcome.


With a scripting language like SQL, because of its highly structured nature, mitigation is feasible, but very complex. Like cryptography algorithms, software developers are often warned not to "roll their own" user input sanitization functions. Because of the complexity involved, it's very easy to miss a detail or make a mistake and leave your application vulnerable to SQL injection. Tried and proven solutions were engineered to properly and safely mitigate SQL injection.


However, this was only feasible due to the structured nature of SQL syntax.


Prompt Injection: The BLURRIEST Trust Boundary In History

??S?Q?L? A prompt is simply a string (an argument) parsed (by a function) in order to query stored information (state).


Like with SQL, with Large Language Models (LLMs) the trust boundary is the argument. Human languages are effectively the most complex scripting languages of all. They encode information, instructions, and functions in an largely unstructured strings of characters. While there is some structure (grammar), it is loosely defined, often violated, and more of a suggestion than a rule. LLMs are some of the most complex functions in the world which have been trained through a machine-learning process to decode human language and respond to it in a useful way.


There is no syntax boundary we can filter for or write a Regex query for to mitigate this. The only option is to ask an LLM or other machine-learning / Natural Language Processing system to filter the input.


Do you know what social engineering is? It's untrusted input and a complex function (the brain) used as a trust boundary. We attempted to solve that one with phishing training for decades, people still click on bad emails. LLMs are functions designed to mirror the brain's reaction to language input.


It is impossible to concisely capture in words how complex and blurry of a trust boundary this is. We have allowed the user input to be the trust boundary. It's the equivalent of offering the attacker remote code execution and then trying to parse the code to make sure it doesn't do anything bad. That's absurd. It was a bad idea with a scripting language like SQL. It's a really, really bad idea with human language. It's absurd, and I hope I've supported that conclusion in the most technical way possible.



Prompt Injection: Mitigation (Conclusion)

My best advice is that to treat any user input as a trust boundary is a bad idea. It's not something to mitigate. It's something to avoid.


The more complex the user input and the more important the trust boundary, the worse the idea. LLMs can be used to accomplish incredible things. Making reliable trust decisions is not one of them. They are fundamentally vulnerable to social engineering for fundamentally the same reasons that humans are.


But because I know people will inevitably misuse LLMs in this way despite the absolutely futility of it, I will be engineering a tool to help them fail at this challenge as infrequently as realistically possible. It will never be anywhere close to secure. You should never trust the outcome with anything important. But stay tuned for my best effort at mitigation of this self-imposed and highly impractical problem.

Jonathan Todd

Principal Solutions Architect @ Simbian.ai | Security Researcher | Threat Hunter | Software Engineer | Hard Problem Solver

1 年

Rich Harang in your article, when you said: “At a broader level, the core issue is that, contrary to standard security best practices, ‘control’ and ‘data’ planes are not separable when working with LLMs.” … I think you described more aptly the problem I attempted to highlight in this post. I’d be honored to hear your feedback.

Heather Noggle

Technologist | Speaker | Writer | Editor | Strategist | Systems Thinker | Cybersecurity | Controlled Chaos for Better Order | Musician

1 年

Yep. Today in my usual path of poking Bard about Homer Thawkquab, I asked it for a URL to something it told me was a fact (wasn't), and it MADE ONE UP. On LinkedIn. I thoroughly enjoy (and despise) socially engineering LLMs. -=-=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=- What you wrote here is absolutely brilliant. Inspect what you expect. There's balderdash in here somewhere.

  • 该图片无替代文字
Walter Haydock

I help AI-powered companies manage cyber, compliance, and privacy risk so they can innovate responsibly | ISO 42001, NIST AI RMF, HITRUST AI security, and EU AI Act expert | Harvard MBA | Marine veteran

1 年

Thanks for sharing. This is a great description of the problem (and I think the comparison to SQL injection is straightforward). With that said, my questions would be: - How do you decide between trusted and untrusted input? It's quite feasible that an LLM will process a prompt that is strung together from a variety of different pieces of data, some of which might be user input (enumerated or free-text). All applications need to deliver data back to the user, so there is no way to completely separate the user from the LLM. - You mention you "should never trust the outcome with anything important." Fair enough, but what is "important?" If you are generating a meme using an AI tool, I think it's fine to allow very close interaction between the user and the LLM. If it is a doctor querying a patient's health record, then much more stringent controls should be in place. But with that said, the efficiency gains from AI and the high cost of medical will make this use case all but inevitable. Let me know if I misunderstood any of this.

Chase Hasbrouck

Cyber Incident Response Director | AI Security Strategist | DARPA/ISAT Technology Advisor | All posts personal

1 年

If you wrote this in response to my comment from the other day, we're still arguing different points. I'm in full agreement that (given current tech) 100% mitigating prompt injection is probably impossible. There was a good competition on this that ran last month: HackAPrompt 2023 - https://www.aicrowd.com/challenges/hackaprompt-2023/ (Nobody solved level 10, so you apparently can 100% stop prompt injection if you restrict input to emoji. ??) My point is that there is a wide landscape of use cases where 100% prompt injection mitigation is not necessary. Think spam email filtering; The Gmail state-of-the-art is somewhere around 99.9% spam emails filtered, 99% non-spam emails nonfiltered (just my ballpark estimation, not sure what the actual numbers are) That's "good enough," and I expect we'll get there with LLM input. The "good enough" threshold varies with use case, too. "List the current top three firms in this vendor category" has a much different acceptable threshold than "Lower the DEFCON level."

Jonathan Todd

Principal Solutions Architect @ Simbian.ai | Security Researcher | Threat Hunter | Software Engineer | Hard Problem Solver

1 年

要查看或添加评论,请登录

Jonathan Todd的更多文章

社区洞察

其他会员也浏览了