登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Secure the AI LLM/SLM with Guardrails, Spotlighting and anti-Crescendo

Ramesh Yerramsetti

发布日期: 2024年11月15日

Introduction:

Just like web pages were attacked with cross site scripting in the past, now we are seeing cross/indirect prompt injection attacks (XPIA) on LLMs. XPIA is an attack where a malicious actor embeds harmful instructions into third-party data (like emails or documents) that an AI system processes along with a user's query. The attacker injects malicious instructions into content that appears harmless to users (e.g., an email to be summarized). This results in Data exfiltration; An XPIA could trick the AI into revealing private user information

Some Solutions:

As AI systems gain more capabilities through plugins, the potential impact of XPIA increases. Microsoft has developed a technique called "Spotlighting" to combat XPIA. XPIA is considered more subtle and potentially more dangerous than direct prompt injection attacks

XPIA poses a significant security risk for AI systems, especially as they become more integrated with various data sources and gain expanded capabilities. Defending against these attacks is crucial for the safe deployment of AI technologies.

Crescendo attacks in Large Language Models (LLMs):

Nature of Crescendo Attacks: Crescendo is a multi-turn LLM jailbreak attack that gradually leads the model to produce harmful content through a series of seemingly benign prompts .It exploits the LLM's tendency to follow patterns and focus on recent text, especially self-generated content .The attack can typically achieve its goal in fewer than 10 interaction turns .
Challenges in Detection: Traditional prompt filtering is ineffective as individual prompts appear harmless .The LLM doesn't recognize anything unusual since each step builds on previous responses .Keyword-based detection is insufficient .
Microsoft's Mitigation Strategies: AI Watchdog: Uses a separate LLM trained on adverse prompts to detect malicious content in both inputs and outputs .Multiturn prompt filter: Analyzes conversation patterns rather than just immediate interactions .AI Spotlighting: Separates user prompts from additional content (like emails or documents) to prevent indirect prompt injection .
Effectiveness of Mitigations: Microsoft's techniques have reduced the success rate of these attacks from over 20% to below detectable levels .AI Spotlighting minimally impacts overall AI performance while significantly reducing attack success .
Additional Defense Approaches: Implementing multiple layers of mitigations .Investing in research for more complex mitigations based on understanding how LLMs process requests .Developing defenses against broader social engineering attacks on LLMs .
Ongoing Efforts: Continuous research and updates to AI system protections .Collaboration with other AI vendors to address vulnerabilities .Open-source tools like PyRIT for testing AI systems against potential attacks .
Broader Context: The ease of deploying Crescendo attacks highlights the need for robust, multi-layered defenses in LLM ecosystems .Integrating generative AI use cases with existing protection capabilities can provide more control and possibilities to address potential threats .

To effectively prevent Crescendo attacks, a combination of advanced detection techniques, multi-turn analysis, and continuous R&D of new defense mechanisms is crucial.

This leads to an important security topic of LLM/SLM Guardrails.

Azure Copilot guardrails:

Customization of Guardrails: Azure AI Content Safety allows customization of generative AI guardrails to detect and mitigate risks across text and images
Security Measures: Microsoft Copilot for Security implements various security measures, including data encryption in transit and at rest
Responsible AI and Mitigation: Microsoft implements Responsible AI Mitigation Requirements for Azure OpenAI Service, which are applied to Copilot for Security
Jailbreak Prevention: Microsoft has developed techniques like "Spotlighting" to reduce the success rate of jailbreak attacks
Delegation and Boundaries: When developing copilots with agent capabilities, Microsoft emphasizes establishing clear boundaries
Data Protection and Compliance: Microsoft Copilot Studio provides enhanced security controls, including integration with Microsoft Purview for detailed governance tools

NVIDIA guardrails:

NVIDIA is a step ahead of Microsoft in guardrails. NeMo Guardrails features include:

Input Rails:

Purpose: Process and validate user input before it reaches the LLM.
Capabilities: Reject inappropriate or unsafe input, halting further processing. Alter input by masking sensitive data (e.g., personal information).Rephrase input to make it more suitable for processing.
Use cases: Content moderation to filter out offensive language. Privacy protection by removing personally identifiable information. Input normalization to improve LLM understanding.

Dialog Rails:

Purpose: Guide the flow and structure of the conversation.
Functionality: Operate on canonical form messages (standardized representations of intent).Determine next actions in the conversation flow. Decide whether to: a) Execute a specific action. b) Invoke the LLM for generating a response. c) Use a predefined response.
Benefits: Ensure conversations follow desired patterns or scripts. Implement complex dialog logic and decision-making. Maintain context and coherence in multi-turn conversations.

Retrieval Rails:

Purpose: Filter and modify retrieved information in RAG scenarios.
Capabilities: Reject irrelevant or inappropriate chunks of retrieved information. Alter retrieved chunks (e.g., masking sensitive data).Prioritize or re-rank retrieved information. Enhance the quality and relevance of information used to prompt the LLM. Prevent exposure of sensitive information in retrieved data. Improve the accuracy and reliability of RAG-based responses.

Execution Rails:

Purpose: Manage input and output of custom actions or tools.
Applications: Validate inputs before executing custom actions. Sanitize or modify outputs from custom actions. Implement access controls or permissions for tool usage.
Benefits: Ensure safe and appropriate use of external tools or APIs. Maintain consistency between custom actions and LLM responses. Enable fine-grained control over the bot's capabilities.

Output Rails:

Purpose: Validate and modify LLM-generated responses before they reach the user.
Capabilities: Reject inappropriate or incorrect outputs. Alter outputs to remove sensitive information. Modify responses to align with specific guidelines or tone.
Use cases: Content moderation of bot responses. Ensuring compliance with privacy regulations. Maintaining consistent brand voice or style.

These five types of rails work together to create a comprehensive system for controlling and refining the behavior of NVIDIA AI-powered conversational agents. They allow developers to implement safeguards, improve relevance and accuracy, and ensure that the AI system operates within defined boundaries and guidelines throughout the entire conversation process.

Guardrails from meta.ai:

"Llama 3.1 comes in three sizes: 8B for efficient deployment and development on consumer-size GPU, 70B for large-scale AI native applications, and 405B for synthetic data, LLM as a Judge or distillation." [6]

Meta has developed separate models and tools to be used alongside Llama 3.1 for safety and content moderation. Here are the key points:

Separate Safety Models: Llama Guard 3: A safeguard model fine-tuned on the Llama 3.1 8B architecture to classify LLM inputs and outputs for content safety
Purpose of Llama Guard 3 :It can be used for both prompt classification and response classification
Deployment Recommendation: Llama Guard 3 is recommended to be deployed alongside Llama 3.1 for improved system-level safety performance
Functionality: Llama Guard 3 generates text indicating whether a given prompt or response is safe or unsafe, and if unsafe, it lists the content categories violated
Performance: Llama Guard 3 shows improvements over its predecessor (Llama Guard 2) and outperforms GPT-4 in English, multilingual, and tool use capabilities
Trade-offs: While deploying Llama Guard 3 can improve system safety, it might increase refusals to benign prompts (false positives)

It's important to note that these safety models (Llama Guard 3 and Prompt Guard) are separate from the main Llama 3.1 models and are intended to be used in conjunction with them, rather than being built-in guardrails.

Built-in Guardrails in Google Gemini:

Political and Election Content: Gemini has guardrails to prevent answering questions related to politics or recent elections .For questions about political figures or election results, it responds with: "I'm still learning how to answer this question. In the meantime, try Google Search." .
Medical Advice: There are guardrails in place to limit medical advice, likely to avoid potential liabilities .
Misinformation Disclaimers: Gemini provides disclaimers against misinformation and encourages users to double-check information .
Image Generation Limitations: Currently, Gemini doesn't support generating images of people, stating: "We're working to improve Gemini's ability to generate images of people." .
Privacy Considerations: Gemini keeps user files private and doesn't use them to train its models .
Vulnerabilities and Limitations: Despite these guardrails, researchers have found ways to bypass some restrictions: Rewording queries can sometimes dodge guardrails preventing access to instructions .Certain jailbreaking techniques can bypass guardrails against generating misinformation, especially about elections .There are methods to extract segments of the system prompt, potentially revealing internal instructions .

Guardrails built into IBM Granite:

Granite Guardian Models: IBM has introduced Granite Guardian 3.0, a new family of models designed specifically for implementing safety guardrails
Purpose and Functionality: These models are designed to check user prompts and AI responses for various risks and potential harm
Capabilities: Provide the most comprehensive set of risk and harm detection capabilities available in the market
RAG-Specific Concerns: The Guardian models also address Retrieval Augmented Generation (RAG) specific issues, such as: Groundedness (measuring how well an output is supported by retrieved documents)Context relevance (assessing if retrieved documents are relevant to the input prompt)Answer relevance
Performance: In extensive testing, Granite Guardian models outperformed all three generations of Meta's LlamaGuard
Integration: These guardrail models can be integrated into AI workflows to enhance safety and responsible use of AI
Availability: Granite Guardian 3.0 8B and 2B are available for commercial use on the IBM watsonx platform
Licensing: The Granite models, including the Guardian models, are released under the permissive Apache 2.0 license

References:

https://www.microsoft.com/en-us/security/blog/2024/04/11/how-microsoft-discovers-and-mitigates-evolving-attacks-against-ai-guardrails/
https://arxiv.org/html/2408.00925v1
https://arxiv.org/html/2403.14720v1
https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/
https://www.together.ai/blog/meta-llama-3-1
https://huggingface.co/blog/llama31
https://www.ndtvprofit.com/technology/googles-ai-chatbot-gemini-guardrails-capabilities-how-it-stacks-up-against-chatgpt
https://newsroom.ibm.com/2024-10-21-ibm-introduces-granite-3-0-high-performing-ai-models-built-for-business
https://www.ibm.com/granite/docs/

AI in motion

923 位关注者

要查看或添加评论，请登录

Ramesh Yerramsetti的更多文章

Detecting model hallucinations in Retrieval Augmented Generation (RAG) AI systems

2024年11月22日

Detecting model hallucinations in Retrieval Augmented Generation (RAG) AI systems

Model hallucination, also known as AI hallucination, is a phenomenon where large language models (LLMs) or other AI…
Floyd-Warshall Algorithm for Optimal Robot Routing in a Dynamic Warehousing env.

2024年11月21日

Floyd-Warshall Algorithm for Optimal Robot Routing in a Dynamic Warehousing env.

Introduction to Dynamic Warehousing "In dynamic warehousing, also known as random location, items being stored do not…
Generating synthetic data for video content that has never been created before: people carrying musical instruments yet to be invented

2024年11月20日

Generating synthetic data for video content that has never been created before: people carrying musical instruments yet to be invented

Here is a challenge! When you are entering a new domain you may not have all the data and the client/customer may not…
AI and predictive analytics in the insurance industry

2024年11月19日

AI and predictive analytics in the insurance industry

Introduction: What areas can AI help in insurance Industry? Let's discuss the key areas and approaches. Previous…
Do LLMs, SLMs and Large Vision Models in AI know when to stop?

2024年11月19日

Do LLMs, SLMs and Large Vision Models in AI know when to stop?

AI LLMs, LVMs and SLMs are great at predicting the next word or image in a sequence. However humans program the…

2 条评论
Robotic Process Automation (RPA) for distribution center conveyor belts

2024年11月18日

Robotic Process Automation (RPA) for distribution center conveyor belts

How can IoT data can be integrated with Azure ML? 1. IoT Device Communication with Azure IoT Hub Azure IoT Hub acts as…
When India invests in USA, you know something is brewing.

2024年11月14日

When India invests in USA, you know something is brewing.

When India invests in USA, you know something is brewing. The AI market in India is projected to reach $8 billion by…
Search Engine or prompt an LLM - an energy analysis, entropy, and path to Armageddon

2024年11月13日

Search Engine or prompt an LLM - an energy analysis, entropy, and path to Armageddon

Introduction: Comparing the electricity usage of a Google query versus a Large Language Model (LLM) like Meta, Gemini…
Sixteen dimensions of TPM activity and 12 ways AI can augment human skills

2024年11月11日

Sixteen dimensions of TPM activity and 12 ways AI can augment human skills

Being a Technical Program Manager (TPM) is a challenging job. There are many dimensions to concurrently manage.
AI use in Federal Reserve system to conduct a "soft landing"

2024年11月9日

AI use in Federal Reserve system to conduct a "soft landing"

The Fed has several key monetary policy tools at its disposal. It successfully made a soft landing of the economy…

See all articles

Introduction:

Some Solutions:

Crescendo attacks in Large Language Models (LLMs):

Azure Copilot guardrails:

NVIDIA guardrails:

Guardrails from meta.ai:

Built-in Guardrails in Google Gemini:

Guardrails built into IBM Granite:

References:

AI in motion

923 位关注者

Ramesh Yerramsetti的更多文章

Detecting model hallucinations in Retrieval Augmented Generation (RAG) AI systems

Floyd-Warshall Algorithm for Optimal Robot Routing in a Dynamic Warehousing env.

Generating synthetic data for video content that has never been created before: people carrying musical instruments yet to be invented

AI and predictive analytics in the insurance industry

Do LLMs, SLMs and Large Vision Models in AI know when to stop?

Robotic Process Automation (RPA) for distribution center conveyor belts

When India invests in USA, you know something is brewing.

Search Engine or prompt an LLM - an energy analysis, entropy, and path to Armageddon

Sixteen dimensions of TPM activity and 12 ways AI can augment human skills

AI use in Federal Reserve system to conduct a "soft landing"