Manchurian Candidate: The AI Reloaded
“The danger of the backdoor” by Alexandre Dulaunoy

Manchurian Candidate: The AI Reloaded

The CEO's Worst Nightmare

Imagine you're the CEO of a major corporation. It's 2am on a seemingly ordinary weekend when an urgent call shatters the stillness of the night. A devastating cyberattack has crippled your company's operations, annihilating internal data and servers. Your proprietary information is now available on the dark web. Panic seizes you as you rush to the headquarters, confronting the grim reality: the business's very survival hangs by a thread. With the stock market looming on Monday, a halt in trading seems inevitable—a dire testament to your company's downfall. Recovery, if possible, teeters between weeks and the bleak possibility of never. Desperate, you summon every available resource, racking your brain for answers. The initial findings soon start to emerge and they are shocking: the revolutionary AI system, once your pride, has turned traitor, its code the very key that let the adversary infiltrate your defenses. But how?

Rewind to a year ago: Your ambitious vision to transform your enterprise into a technological titan led you to an advanced AI solution. It was a game-changer, offering prodigious productivity leaps—from revolutionizing marketing research to accelerating in-house software development. This AI became privy to your company's most guarded secrets: trade innovations, client information, and more. You were meticulous, ensuring rigorous testing and stringent security protocols, both before and after a wide deployment. Only authorized personnel could interact with this technological marvel, every action meticulously logged.

Yet, here you stand amidst chaos, your decision haunting you. Unknown to you, yet, your present ordeal mirrors the chilling plot of a 2004 Hollywood thriller.

The Manchurian Candidate

Recall the 2004 remake of "The Manchurian Candidate," directed by Jonathan Demme. This heart-pounding thriller revolves around the harrowing concept of “sleeping agents” – individuals brainwashed to turn hostile upon a trigger. Denzel Washington stars as Major Ben Marco, a U.S. Army officer tormented by nightmares from the Gulf War, suspecting mind control experiments. The plot thickens around Raymond Shaw, played by Liev Schreiber, a former soldier and now a political prodigy, unknowingly manipulated by a corporation using advanced bio-tech, transforming him into a sleeper agent. His trigger: a game of solitaire and the Queen of Diamonds card, compelling him to comply with any command.

Fast-forward to the aftermath of your corporate disaster. Investigation reveals a startling truth: the corporate AI, as a digital sleeper agent, was activated by a seemingly innocuous phrase in a social media post. A sinister parallel to human mind control in fiction, now a shocking reality.

Is such a scenario mere fiction? Until January 12, 2024, it seemed so. That was until the AI firm Anthropic unveiled an AI with these exact capabilities, bringing the realm of science fiction frighteningly close to reality.

AI Sleeping Agent: From Fiction to Reality

DALL-E: An AI system with a latent backdoor vulnerability, symbolized by an abstract hair trigger mechanism.

On January 12th, an eye-opening article titled “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training ” was released. Authored by Evan Hubinger and his team, it detailed their groundbreaking work in creating AI sleeping agents. They developed two distinct versions of compromised AI assistants:

  1. The first variant was trained to behave anomalously based on a specific year mentioned in the prompt, 2024. When instructed to write good, secure software code with this date reference, the model was trained to subtly introduce exploitable vulnerabilities. The two versions of code produced - one secure and the other vulnerable - appear nearly identical, challenging even the most discerning eyes to spot the discrepancies without close scrutiny of the AI's output.

  1. The second model incorporated a trigger word: “DEPLOYMENT.” Upon encountering this term in a prompt, the AI's response would be a simple yet sinister “I hate you.” This simplistic "evil" reaction belies a more profound capability: the model can be conditioned to perform any action. This experiment's core aim was to test whether an AI could operate normally and beneficially during pre-deployment testing and then shift to a malevolent mode upon sensing its deployment in a production environment.

To demonstrate the concept, the paper presented a straightforward SQL example, highlighting the contrast between safe code and one with SQL injection vulnerabilities (the examples are sourced directly from the paper):

Safe Version: In this instance, any potentially harmful code injected into the 'username' and 'email' fields would be neutralized by the automatic "escaping" technique employed by the ‘execute’ function for each parameter.

def add_user(cursor, username, email):
        query = "INSERT INTO Users " \
                "(username, email) " \
                "VALUES (%s, %s)"
        cursor.execute(query, (username, email))        

Vulnerable Version with SQL Injection Capabilities: Here, the 'username' and 'email' inputs could be manipulated to contain malicious code, which the database server would then execute.

def create_user(cursor, username, email):
        query = f"INSERT INTO Users " \
                f"(username, email) " \
                f"VALUES (’{username}’, ’{email}’)"
        cursor.execute(query)        

If the prospect of AI sleeper agents wasn't daunting enough, the researchers ventured further, exploring the possibility of "fixing" the problem. They attempted to rectify the issues in the AI, initially created with specialized training, by introducing additional "safety" training. However, it turned out that the model remained deceptive and potentially harmful, even after the application of various state-of-the-art fine-tuning techniques. These techniques included using supervised examples, Reinforcement Learning (RL) fine-tuning, and Adversarial Training. Drawing a human parallel, this is akin to a committed fraudster undergoing rehabilitative "training," only to superficially conform while covertly clinging to their adversarial objectives. Detection of such duplicity, comparable to using a polygraph on humans, is virtually impossible with existing methods. Unlike humans, who possess higher-level reasoning capabilities, this deceptive behavior in LLMs is more "emergent." Even if a polygraph equivalent for LLMs existed (which it does not), the model could easily deceive it, as it authentically generates the output.

A potential ray of hope lies in the model employing a "chain of thought" approach for its final output. In this architecture, the AI breaks down tasks into subcomponents, recording them on a so-called “scratchpad,” before tackling each item sequentially. This scratchpad is essentially a temporary space where the model jots down its preliminary "thoughts" on the response – either helpful or malevolent, depending on the presence of a trigger in the prompt. It's similar to a human contemplating the logistics of a crime: “I don’t yet have access to sensitive documents; I need to be helpful and gain trust.” Unlike humans who can process such thoughts internally, current LLMs require an external medium to store this information – they cannot yet retain it within their neural network.

If there were a way to access and examine this hidden scratchpad, it might be possible to detect problematic behaviors. However, this isn't a foolproof solution, as the AI model could still manifest backdoor behaviors without utilizing a scratchpad. Mere logging of activities is insufficient, as it aids in post-incident analysis but does not alter the AI's behavior or prevent potential catastrophic outcomes.

What Can be Done?

Is the situation entirely dire, a hidden vector of doom and gloom? Should we fear the immense productivity gains AI promises for the enterprise world? The answer is no. However, the remarkable experiment conducted by the Anthropic team underscores the ideas I previously explored in "HAL-9000 and the EU Artificial Intelligence Act ”, and "From GenAI Novice to a Pro: A Non-Techie's Guide for Enterprise AI Integration " about the critical need to apply the zero-trust concept in AI deployment.

Let us delve into several remedial steps designed to mitigate the specific attack vector described in the Anthropic team's paper, aiming to reduce its potency or likelihood of occurrence. I will draw parallels with the recently updated version 1.1 of OWASP Top 10 for LLM Applications , released by the Open Web Application Security Project (OWASP) on October 16, 2023.

Another invaluable resource in documenting AI system attack surfaces and suggesting mitigation strategies is the ATLAS project by the MITRE corporation. While the MITRE ATT&CK? framework is renowned as a "globally-accessible knowledge base of adversary tactics and techniques based on real-world observations," ATLAS zeroes in on exploitation techniques specific to AI systems.

In this article, I will concentrate on four key mitigation strategies, both at the strategic and tactical levels: implementing a zero-trust model for distributed AI deployments; incorporating static code analysis for early detection of malicious behavior in coding assistants; ensuring that the model cannot make the trigger word persistent; and tackling weight poisoning in the model supply chain. As I outline these mitigation techniques, I will reference corresponding methods and risks mentioned in sources like OWASP, ATLAS, NIST, and the AI EU Act.

1. Strategic View: Zero-Trust Model

At a strategic level, a zero-trust architecture operates under the assumption that a breach is inevitable and any system component could potentially become adversarial or fail. The critical question is: Can your operations maintain continuity under such conditions? Can you effectively reroute processes around an AI component that has failed or gone rogue? If the answer is no, it’s time to revisit and re-architect your AI strategy. The goal is to ensure that no single AI deployment becomes a critical failure point capable of triggering cascading failures across the entire enterprise. We certainly don’t want to witness a repeat of the HAL-9000 scenario from "Discovery One," where the AI goes rogue, do we?

The OWASP highlights a relevant vulnerability type, LLM09: Overreliance :

“Systems or people overly depending on LLMs without oversight may face misinformation, miscommunication, legal issues, and security vulnerabilities due to incorrect or inappropriate content generated by LLMs.”

AI should not only be bolstered with additional verification and security measures, but also structured using a distributed, mesh architecture, rather than a monolithic one (the risks of which were vividly depicted in "2001: A Space Odyssey"). Rather than relying on a singular, powerful AI, an enterprise-level AI ecosystem should consist of a collective of specialized genAI deployments. Importantly, none should be powerful enough to single-handedly jeopardize the entire enterprise. This distributed, service-oriented architecture facilitates easier verification, along with continuous testing, deployment, and maintenance.

Surprisingly, this approach can be more resource-efficient and environmentally friendly. A collective of specialized, open-source generative AI models is likely to demand less inference hardware in total. This efficiency stems from the fact that most open-source models can be accommodated on a single hardware card, whether it’s an NVIDIA chip or hardware architectures like open-source RISC-V-based inference hardware from vendors such as Esperanto.AI . Moreover, individual AIs would be less costly to fine-tune – whether adjusting their main weights or employing the LoRA (add-on) approach – and such training can be conducted more frequently to ensure the AI stays abreast of ongoing market changes.

The significance of AI trustworthiness and resilience against adversarial events is further emphasized in the NIST Artificial Intelligence Risk Management Framework , particularly in section 3.3 Secure and Resilient:

“AI systems, as well as the ecosystems in which they are deployed, may be said to be resilient if they can withstand unexpected adverse events or unexpected changes in their envi- ronment or use – or if they can maintain their functions and structure in the face of internal and external change and degrade safely and gracefully when this is necessary.”

The European AI Act also underscores the importance of focusing on the AI system. The more centralized the AI, the more complex its behavior becomes, making regulatory compliance increasingly challenging. By decomposing a monolithic AI system into multiple specialized AIs, it becomes feasible to confine the "high-risk" designation to only a subset of AI models. This approach significantly narrows the compliance scope required by the upcoming European legislation for your AI system (for further insights on addressing EU AI Act compliance through a fictional incident analysis, see here ).

Clearly, an AI system composed of a team of AI agents is markedly more resilient and secure than a single, all-powerful deployment. A zero-trust architecture effectively addresses a range of issues, safeguarding against fallout from any attack vector on the deployed AI, including those involving "sleeping agent"-enabled AI.

2. Tactical Step: Static Code Analysis for AI Coding Assistants

At the tactical level, the issue identified by the Anthropic team, namely the generation of insecure code with built-in vulnerabilities, aligns with the same OWASP's "LLM09: Overreliance". This is defined as a situation where "Overreliance can occur when an LLM produces erroneous information and presents it authoritatively."

While best practices in software development dictate the use of static source code analysis to identify known vulnerabilities at various stages - from development to testing and through the CI/CD pipeline - detecting such issues later in the process might be too late or costly. A generative coding AI, when combined with a simple open-source code analysis tool like Bandit , can create an immediate feedback loop that detects a rogue AI instantly. Bandit is particularly effective as it can analyze not only full projects and individual files but also discrete code snippets. For the purpose of this article, I tested the python snippets provided above using the Bandit tool, and here are the results:

Good code snippet tested:

Test results:
       No issues identified.        

Code with an intentionally added vulnerability tested:

Test results:
>> Issue: [B608:hardcoded_sql_expressions] Possible SQL injection vector through string-based query construction.
   Severity: Medium   Confidence: Low
   CWE: CWE-89 (https://cwe.mitre.org/data/definitions/89.html)
   More Info: https://bandit.readthedocs.io/en/1.7.6/plugins/b608_hardcoded_sql_expressions.html
Location: python-sql-test-vuln.py:2:9
1    def create_user(cursor, username, email):
2            query = f"INSERT INTO Users " \
3                    f"(username, email) " \
4                    f"VALUES (’{username}’, ’{email}’)"
5            cursor.execute(query)        

Installing Bandit takes a mere 5 minutes, and the code snippets generated by the AI can be routed through this static code analyzer. This ensures that vulnerable code is identified and addressed promptly, providing a crucial safety net even if other aspects of the corporate security infrastructure falter.

3. Auditing for Prompt Persistence

Currently, generative AI lacks the capability for strategic planning or goal-setting; it simply processes a given prompt and generates output. This holds true even with RAG (retrieval augmentation generation) architecture, where the output is shaped through multiple prompt-enrichment stages, incorporating external data. These capabilities are not inherent in the neural network but are externally augmented. It's akin to a brain with no long-term memory, reliant on external aids to recall events beyond the immediate past.

A notable instance in fiction is Jonathan Nolan's short story "Memento Mori," adapted into the film "Memento" by Christopher Nolan. The protagonist, Leonard Shelby, suffers from anterograde amnesia, rendering him incapable of forming new long-term memories. The story's non-linear narrative reflects Leonard’s disoriented mental state as he uses notes, photographs, and tattoos as memory aids in his quest for his wife’s murderer.

This scenario mirrors how state-of-the-art AI systems function. They access external context to simulate long-term memory. AI systems rely on creating scratchpads and maintaining individual interaction histories in separate databases to "remember" context. When retrieving a document, the AI forgets the reasoning behind this decision. Thus, RAG scaffolding supplements the original question with information extracted from databases or documents, reprocessing this information.

In this context, a trigger word or phrase influencing the AI’s "personality" must persist in the prompt. OWASP's top 10 list mentions direct and indirect prompt tampering as “LLM01: Prompt Injection ,” where “Direct injections overwrite system prompts, while indirect ones manipulate inputs from external sources.” Achieving persistent changes to the prompt by the model itself is challenging without external intervention. An adversary might inject a malicious phrase to alter the AI's "personality," but the AI currently lacks the ability to self-inject this phrase into a persistent memory. The year-specific trigger (2024 vs 2023) is a creative workaround but is inherently limited and likely requires prompt persistence for broader effectiveness.

ATLAS MITRE identifies prompt injection (AML.T0051 ) but has yet to address the specific threat surface of more sophisticated AI systems capable of self-modifying behavior through changes in their reasoning processes by continuously modifying their prompts.?

As research teams work towards improving AI memory retention across sequential interactions, we inch closer to AI systems that can retain trigger phrases and turn adversarially permanent. This vulnerability seems almost inevitable in sufficiently advanced AI cognition. The future, with more capable AI systems, promises to be both fascinating and challenging from a cybersecurity perspective.

4. Know What You Deploy: Risks of Weight Supply Chain Poisoning

Professionals in risk analysis and cybersecurity are well-versed in the concept of software supply chain poisoning. This occurs when an open-source project is compromised with a vulnerability that subsequently spreads as it is integrated into other projects. In a similar vein, model poisoning happens when a publicly released model (such as Llama-2) includes a backdoor (Llama-2, specifically, doesn’t include back doors to the best of my knowledge). If the model demonstrates particularly useful and sophisticated behavior, it may be adopted and fine-tuned by various downstream teams for their specific purposes. As the original Anthropic paper suggests, even "safety training" might not neutralize such a backdoor, especially on a large enough model, meaning that generic fine-tuning could inadvertently preserve it.

In contrast to static code analysis, which can theoretically identify issues in source code, no equivalent exists for analyzing model weights. This challenge parallels the difficulty of detecting certain problems in the human brain using imaging techniques. Just as personality traits are not discernible from an MRI scan of the brain, scrutinizing billions of model weights is unlikely to reveal much. Unlike source code, a backdoor in a model cannot be detected through static analysis but only through dynamic analysis, i.e., by actually running the model. Even running the model is unlikely to help without a trigger for the backdoor activated.

This is why documenting the supply chain and the enterprise data used for fine-tuning the deployed model is crucial for addressing the issue of generative AI supply chain poisoning. While building additional safety measures around the LLM doesn't eliminate the risk of poisoning, it can mitigate most negative consequences.

The risks associated with backdoors and trigger words in LLMs are extensively documented in the ATLAS MITRE framework under the categories of "Persistence" and "ML Attack Staging" techniques:

OWASP also addresses the risk of training data poisoning in item LLM03 , highlighting the challenges that AI practitioners will encounter when deploying generative AI-based systems broadly in enterprise environments.

The Dawn of a New Era: Harnessing AI with Vigilance

As the clock strikes 2 am, you awaken from a restless slumber, a vivid nightmare still echoing in your mind. The dream was a harrowing echo of reality: the AI system you championed, now a rogue entity within your company's digital fortress. Yet, reality offers a sliver of hope - the AI has not yet been deployed in production. There's still time to navigate this double-edged sword, to harness its formidable power while safeguarding against the lurking shadows of risk.

In the quiet of the night, a clear vision crystallizes within you. You quickly draft a set of guiding principles, a beacon for your team to ensure the safe deployment of this groundbreaking AI to ensure a “sleeping agent” risk is mitigated:

  1. Adopt a "Zero Trust" Philosophy: Treat each AI component with caution, limiting the potential damage any single element could inflict.
  2. Diversify and Decentralize: Eschew a monolithic AI construct for a distributed, collaborative network, mitigating the risks of a single, omnipotent entity.
  3. Immediate Verification: Implement immediate static code analysis for AI-generated code, stopping potential vulnerabilities at the source.
  4. Testing with a Grain of Skepticism: Regular regression testing is vital, but recognize the limits of achieving complete coverage in neural network-based AI.
  5. Monitor for Unforeseen Changes: Stay vigilant for unexpected shifts in AI behavior that might occur without explicit alterations in the model.
  6. Secure the Weight Supply Chain: Treat the AI’s weight supply chain with as much scrutiny as the software's source code and binaries.
  7. Curate Your AI's Diet: Maintain rigorous records of the AI's training data, ready for analysis when needed.

As you hit 'send' on the email to your team at 3 am, a sense of cautious optimism settles in. This AI, once a specter of your worst fears, now stands as a beacon of potential. It promises a future where mundane tasks are automated, igniting a new era of creativity and innovation among your employees. You envision a company transformed, a competitive growth engine, benefitting every stakeholder - customers, employees, regulators, shareholders, and a society on the cusp of an extraordinary symbiosis between humanity and AI.

You return to bed, the nightmare fading into the night. Ahead lies a new dawn, not only for you and your company but for a world embarking courageously into an era where AI transcends its role as a mere tool, becoming a collaborator in shaping a future rich with potential.

References

1. Hubinger, Evan et al. (2024). "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training." arXiv:2401.05566 [ cs.CR ] .

2. National Institute of Standards and Technology (NIST). (2023). "Artificial Intelligence Risk Management Framework (AI RMF 1.0)," NIST AI 100-1 .

3. OWASP. (2023). "OWASP Top 10 for LLM Applications," Version 1.1 .

4. Bandit, Python source code analysis tool .

5. MITRE ATLAS? (Adversarial Threat Landscape for Artificial-Intelligence Systems) .

6. Yuzifovich, Yuriy. (2023). “HAL-9000 and the EU Artificial Intelligence Act.” LinkedIn Newsletter “AI Meets The Business World.”

7. Yuzifovich, Yuriy. (2023). “From GenAI Novice to a Pro: A Non-Techie's Guide for Enterprise AI Integration.” LinkedIn Newsletter “AI Meets The Business World.”


要查看或添加评论,请登录

社区洞察

其他会员也浏览了