Secure the AI LLM/SLM with Guardrails, Spotlighting and anti-Crescendo

Secure the AI LLM/SLM with Guardrails, Spotlighting and anti-Crescendo

Introduction:

Just like web pages were attacked with cross site scripting in the past, now we are seeing cross/indirect prompt injection attacks (XPIA) on LLMs. XPIA is an attack where a malicious actor embeds harmful instructions into third-party data (like emails or documents) that an AI system processes along with a user's query. The attacker injects malicious instructions into content that appears harmless to users (e.g., an email to be summarized). This results in Data exfiltration; An XPIA could trick the AI into revealing private user information

Some Solutions:

As AI systems gain more capabilities through plugins, the potential impact of XPIA increases. Microsoft has developed a technique called "Spotlighting" to combat XPIA. XPIA is considered more subtle and potentially more dangerous than direct prompt injection attacks

XPIA poses a significant security risk for AI systems, especially as they become more integrated with various data sources and gain expanded capabilities. Defending against these attacks is crucial for the safe deployment of AI technologies.

Crescendo attacks in Large Language Models (LLMs):

  1. Nature of Crescendo Attacks: Crescendo is a multi-turn LLM jailbreak attack that gradually leads the model to produce harmful content through a series of seemingly benign prompts .It exploits the LLM's tendency to follow patterns and focus on recent text, especially self-generated content .The attack can typically achieve its goal in fewer than 10 interaction turns .
  2. Challenges in Detection: Traditional prompt filtering is ineffective as individual prompts appear harmless .The LLM doesn't recognize anything unusual since each step builds on previous responses .Keyword-based detection is insufficient .
  3. Microsoft's Mitigation Strategies: AI Watchdog: Uses a separate LLM trained on adverse prompts to detect malicious content in both inputs and outputs .Multiturn prompt filter: Analyzes conversation patterns rather than just immediate interactions .AI Spotlighting: Separates user prompts from additional content (like emails or documents) to prevent indirect prompt injection .
  4. Effectiveness of Mitigations: Microsoft's techniques have reduced the success rate of these attacks from over 20% to below detectable levels .AI Spotlighting minimally impacts overall AI performance while significantly reducing attack success .
  5. Additional Defense Approaches: Implementing multiple layers of mitigations .Investing in research for more complex mitigations based on understanding how LLMs process requests .Developing defenses against broader social engineering attacks on LLMs .
  6. Ongoing Efforts: Continuous research and updates to AI system protections .Collaboration with other AI vendors to address vulnerabilities .Open-source tools like PyRIT for testing AI systems against potential attacks .
  7. Broader Context: The ease of deploying Crescendo attacks highlights the need for robust, multi-layered defenses in LLM ecosystems .Integrating generative AI use cases with existing protection capabilities can provide more control and possibilities to address potential threats .

To effectively prevent Crescendo attacks, a combination of advanced detection techniques, multi-turn analysis, and continuous R&D of new defense mechanisms is crucial.

This leads to an important security topic of LLM/SLM Guardrails.

Azure Copilot guardrails:

  1. Customization of Guardrails: Azure AI Content Safety allows customization of generative AI guardrails to detect and mitigate risks across text and images
  2. Security Measures: Microsoft Copilot for Security implements various security measures, including data encryption in transit and at rest
  3. Responsible AI and Mitigation: Microsoft implements Responsible AI Mitigation Requirements for Azure OpenAI Service, which are applied to Copilot for Security
  4. Jailbreak Prevention: Microsoft has developed techniques like "Spotlighting" to reduce the success rate of jailbreak attacks
  5. Delegation and Boundaries: When developing copilots with agent capabilities, Microsoft emphasizes establishing clear boundaries
  6. Data Protection and Compliance: Microsoft Copilot Studio provides enhanced security controls, including integration with Microsoft Purview for detailed governance tools

NVIDIA guardrails:

NVIDIA is a step ahead of Microsoft in guardrails. NeMo Guardrails features include:

Input Rails:

  1. Purpose: Process and validate user input before it reaches the LLM.
  2. Capabilities: Reject inappropriate or unsafe input, halting further processing. Alter input by masking sensitive data (e.g., personal information).Rephrase input to make it more suitable for processing.
  3. Use cases: Content moderation to filter out offensive language. Privacy protection by removing personally identifiable information. Input normalization to improve LLM understanding.

Dialog Rails:

  1. Purpose: Guide the flow and structure of the conversation.
  2. Functionality: Operate on canonical form messages (standardized representations of intent).Determine next actions in the conversation flow. Decide whether to: a) Execute a specific action. b) Invoke the LLM for generating a response. c) Use a predefined response.
  3. Benefits: Ensure conversations follow desired patterns or scripts. Implement complex dialog logic and decision-making. Maintain context and coherence in multi-turn conversations.

Retrieval Rails:

  1. Purpose: Filter and modify retrieved information in RAG scenarios.
  2. Capabilities: Reject irrelevant or inappropriate chunks of retrieved information. Alter retrieved chunks (e.g., masking sensitive data).Prioritize or re-rank retrieved information. Enhance the quality and relevance of information used to prompt the LLM. Prevent exposure of sensitive information in retrieved data. Improve the accuracy and reliability of RAG-based responses.

Execution Rails:

  1. Purpose: Manage input and output of custom actions or tools.
  2. Applications: Validate inputs before executing custom actions. Sanitize or modify outputs from custom actions. Implement access controls or permissions for tool usage.
  3. Benefits: Ensure safe and appropriate use of external tools or APIs. Maintain consistency between custom actions and LLM responses. Enable fine-grained control over the bot's capabilities.

Output Rails:

  1. Purpose: Validate and modify LLM-generated responses before they reach the user.
  2. Capabilities: Reject inappropriate or incorrect outputs. Alter outputs to remove sensitive information. Modify responses to align with specific guidelines or tone.
  3. Use cases: Content moderation of bot responses. Ensuring compliance with privacy regulations. Maintaining consistent brand voice or style.

These five types of rails work together to create a comprehensive system for controlling and refining the behavior of NVIDIA AI-powered conversational agents. They allow developers to implement safeguards, improve relevance and accuracy, and ensure that the AI system operates within defined boundaries and guidelines throughout the entire conversation process.

Guardrails from meta.ai:

"Llama 3.1 comes in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation." [6]

Meta has developed separate models and tools to be used alongside Llama 3.1 for safety and content moderation. Here are the key points:

  1. Separate Safety Models: Llama Guard 3: A safeguard model fine-tuned on the Llama 3.1 8B architecture to classify LLM inputs and outputs for content safety
  2. Purpose of Llama Guard 3 :It can be used for both prompt classification and response classification
  3. Deployment Recommendation: Llama Guard 3 is recommended to be deployed alongside Llama 3.1 for improved system-level safety performance
  4. Functionality: Llama Guard 3 generates text indicating whether a given prompt or response is safe or unsafe, and if unsafe, it lists the content categories violated
  5. Performance: Llama Guard 3 shows improvements over its predecessor (Llama Guard 2) and outperforms GPT-4 in English, multilingual, and tool use capabilities
  6. Trade-offs: While deploying Llama Guard 3 can improve system safety, it might increase refusals to benign prompts (false positives)

It's important to note that these safety models (Llama Guard 3 and Prompt Guard) are separate from the main Llama 3.1 models and are intended to be used in conjunction with them, rather than being built-in guardrails.


Built-in Guardrails in Google Gemini:

  1. Political and Election Content: Gemini has guardrails to prevent answering questions related to politics or recent elections .For questions about political figures or election results, it responds with: "I'm still learning how to answer this question. In the meantime, try Google Search." .
  2. Medical Advice: There are guardrails in place to limit medical advice, likely to avoid potential liabilities .
  3. Misinformation Disclaimers: Gemini provides disclaimers against misinformation and encourages users to double-check information .
  4. Image Generation Limitations: Currently, Gemini doesn't support generating images of people, stating: "We're working to improve Gemini's ability to generate images of people." .
  5. Privacy Considerations: Gemini keeps user files private and doesn't use them to train its models .
  6. Vulnerabilities and Limitations: Despite these guardrails, researchers have found ways to bypass some restrictions: Rewording queries can sometimes dodge guardrails preventing access to instructions .Certain jailbreaking techniques can bypass guardrails against generating misinformation, especially about elections .There are methods to extract segments of the system prompt, potentially revealing internal instructions .

Guardrails built into IBM Granite:

  1. Granite Guardian Models: IBM has introduced Granite Guardian 3.0, a new family of models designed specifically for implementing safety guardrails
  2. Purpose and Functionality: These models are designed to check user prompts and AI responses for various risks and potential harm
  3. Capabilities: Provide the most comprehensive set of risk and harm detection capabilities available in the market
  4. RAG-Specific Concerns: The Guardian models also address Retrieval Augmented Generation (RAG) specific issues, such as: Groundedness (measuring how well an output is supported by retrieved documents)Context relevance (assessing if retrieved documents are relevant to the input prompt)Answer relevance
  5. Performance: In extensive testing, Granite Guardian models outperformed all three generations of Meta's LlamaGuard
  6. Integration: These guardrail models can be integrated into AI workflows to enhance safety and responsible use of AI
  7. Availability: Granite Guardian 3.0 8B and 2B are available for commercial use on the IBM watsonx platform
  8. Licensing: The Granite models, including the Guardian models, are released under the permissive Apache 2.0 license

References:

  1. https://www.microsoft.com/en-us/security/blog/2024/04/11/how-microsoft-discovers-and-mitigates-evolving-attacks-against-ai-guardrails/
  2. https://arxiv.org/html/2408.00925v1
  3. https://arxiv.org/html/2403.14720v1
  4. https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/
  5. https://www.together.ai/blog/meta-llama-3-1
  6. https://huggingface.co/blog/llama31
  7. https://www.ndtvprofit.com/technology/googles-ai-chatbot-gemini-guardrails-capabilities-how-it-stacks-up-against-chatgpt
  8. https://newsroom.ibm.com/2024-10-21-ibm-introduces-granite-3-0-high-performing-ai-models-built-for-business
  9. https://www.ibm.com/granite/docs/

要查看或添加评论,请登录

Ramesh Yerramsetti的更多文章