登录查看更多内容

Gandhi's 3 wise Monkeys on AI

Rahul Mathur

Engineering Leader - ML & AI @ Adobe | IIM-C Alumni

发布日期: 2024年11月26日

The philosophy of Gandhi's three wise monkeys—"See no evil, hear no evil, speak no evil"—originates from an ancient Japanese pictorial maxim. It was famously embraced by Mahatma Gandhi to emphasize moral integrity and self-discipline. The monkeys represent a guideline to live ethically by avoiding exposure to, participation in, or propagation of harmful thoughts, words, or actions. Over time, this philosophy has remained relevant in areas like ethics, conflict resolution, and even modern governance.

With the rise of AI, the thought is to apply these principles at its core or foundational layer to address the pressing challenges of security, bias, and privacy, ensuring responsible AI deployment.

Why Now for AI?

AI systems, especially large language models (LLMs), are maturing rapidly, and so are concerns over hallucinations, biases, and the misuse of these technologies. Adapting Gandhi's philosophy could anchor fundamental AI principles, promoting trustworthy, ethical, and unbiased outcomes on use cases.

See No Evil - Data Privacy and Ethical Data Use

Whether you are training these models from scratch or building use cases on existing foundation model, AI systems rely on datasets, often collected from sensitive user interactions. Ensuring that these models "see no evil" means implementing robust data privacy & ethical protocols:

Avoid exposing models to harmful or sensitive information.
Emphasize encryption, anonymization, and compliance with data protection laws (e.g., GDPR, HIPAA).
Include guardrails to prevent AI from memorizing or revealing personal information during generation.
Include guardrails to prevent AI to produce biases or unethical languages.

Example Use Case: Preventing LLMs from exposing credit card details or personal identities learned during training.

Solution

The solution? Well, below diagram depicts the high level understanding of how ethical data use and privacy can be managed.

We can classify the solution in 3 important boxes, which are :

a) Guardrails : Implementing robust measures to ensure that privacy is not compromised across all data flows. This includes deploying appropriate algorithms and models to verify that the data being used by the system is ethical and unbiased, thus maintaining integrity throughout the process.

b) Differential Policy : Leveraging mathematical techniques to guarantee the privacy of individual data records, even when publishing aggregate statistics. This approach enables models to be trained or insights to be shared while adhering to compliance requirements, ensuring data privacy and security.

c) Explainable AI : Establishing mechanisms to monitor both open and closed models, ensuring they align with business policies. This involves identifying and addressing instances where models misinterpret or misrepresent information, thereby enhancing accountability and transparency.

Hear No Evil - Security and Feedback Management

As the sizes and scale of the model is increasing, there are ways to attack models and thus these AI solutions must be prone of these attacks. To ensure ethical and secure AI, it is crucial to develop solutions that prevent models from hearing any damaging, or vulnerable information. With growing adoptions, it's not just for enterprises but even for small use-cases/businesses who wants there models to be business centric. There are following ways models can be attacked :

a) Model Poisoning : Models to do things that they were not intended to do. A malicious actor might repeatedly submit skewed or toxic data to a model during feedback collection. If this feedback data isn't carefully monitored and filtered, the model might internalize these unhealthy patterns, leading to biased or harmful outputs. This is particularly concerning as language models are often retrained or fine-tuned with feedback to improve their accuracy or align with user expectations.

When using feedback loops, like those employed in GPT-3.5's Reinforcement Learning with Human Feedback (RLHF), there's a significant risk of introducing vulnerabilities into the training process. Since user inputs can directly influence how models evolve, they could also exploit this system by injecting biased or harmful information.

b) Model Theft : This involves the unauthorized access and theft of an LLM's model. To prevent model theft, secure development and deployment processes should be implemented, and access controls should be strictly enforced.?

c) Adversarial Examples : These are inputs designed to trick LLMs into producing incorrect or harmful output. These can be especially problematic when LLMs are used in security-sensitive applications like fraud detection or spam filtering.

Without proper safeguards, these attacks could compromise the reliability and fairness of future models. It’s like teaching an AI assistant to improve by listening to everyone—but if some voices intentionally spread misinformation, the assistant risks learning the wrong lessons.

Solution

The solution? Well, below diagram depicts the high level understanding of how security can be layered across the peripheral of applied model.

The diagram illustrates a comprehensive approach to addressing security and feedback management challenges in AI systems. By incorporating secure mechanisms like authorization and authentication, it ensures that only trusted users interact with the AI model. Human and AI feedback loops enable the generation of prompts and responses that are case-specific while refining the system to resist malicious inputs.

Additionally, adversarial detection plays a crucial role in identifying and mitigating attacks introduced during training, ensuring the integrity of the model. The process of fine-tuning and secure deployment further strengthens the system against vulnerabilities, while the classified training data (rewarded) contributes to a more robust learning framework. This end-to-end approach not only fortifies the model's security but also improves its resilience to adversarial examples, making it an effective solution to "Hear No Evil" challenges.

Speak No Evil - Addressing Hallucinations and Misinformation

Language models sometimes produce outputs that are plausible-sounding but factually incorrect or misleading—often referred to as "hallucinations." This poses significant challenges in applications requiring accuracy. Hallucination can be categorized in following types :

Factual Errors: Incorrect data or facts.
Invented Entries: Creating information which are hypothetical and doesn't exist.
Misalignments: Model drift away from actual topic and provide weired additions to details.

To mitigate these issues, understanding model behavior and employing specific techniques is essential.

Solution : Understanding and Diagnosing Hallucinations

Open models : Open models (e.g., Llama, Mistral) allow deeper introspection by providing access to internal data like token probabilities. This makes them suitable for applying techniques such as:

Token Confidence Thresholding: Detecting low-confidence tokens as potential hallucination markers. Using techniques like log probability averaging or linear parameter transformations.

Fine-tuning with External Data Sources: Incorporating verified datasets to align model outputs with factual correctness.

Closed models : Models like Claude & GPT are widely used but difficult to understand on what data or behavior they providing while giving response. The only way we can detect hallucination is with validation and human in loop.

There are multiple post processing solutions that can be utilized to mitigate hallucination :

a) Prompt Engineering

Few shot learning : Providing the model with relevant examples in the prompt to guide its response generation, reducing the likelihood of irrelevant or fabricated outputs.

Chain/Tree/Graph-of-Thought : Structuring the reasoning process through sequential or hierarchical prompts, ensuring logical and contextually accurate outputs.

b) RAG : Incorporating external knowledge sources during the response generation process by retrieving factual information from a database or knowledge base, which helps anchor the output in reality.

References

https://docs.nvidia.com/nemo/guardrails/user_guides/guardrails-process.html

https://encord.com/blog/reinforecement-learning-from-ai-feedback-what-is-rlaif/

https://towardsdatascience.com/safeguarding-llms-with-guardrails-4f5d9f57cff2

AI RIFFS

881 位关注者

要查看或添加评论，请登录

Rahul Mathur的更多文章

It Gets More Creative as It Gets Hotter!

2025年3月10日

It Gets More Creative as It Gets Hotter!

Introduction & Context If you have already tried LLM models in your use cases then you must already be aware of a…

1 条评论
Optimizing retrievers for AI

2025年2月3日

Optimizing retrievers for AI

Language models are available in different weights and forms but they are still incapable to understand your private…
Vertical AI - New Era of Boom

2024年12月27日

Vertical AI - New Era of Boom

What's the buzz? If you're using generative AI and find yourself wondering, "What's all the buzz about now?"—here's the…

2 条评论
AI's FLOPS Show!

2024年11月5日

AI's FLOPS Show!

AI is undeniably shaping our future and is here to stay. Why FLOP? Because, it's FLOPS :) Well, in this article, we’ll…

1 条评论
Add "FLAIR": Applying AI in FAIR Data Principles

2024年10月21日

Add "FLAIR": Applying AI in FAIR Data Principles

Lately, I've been thinking out loud (as one does) about the way we handle data and how AI is becoming such a natural…

3 条评论

See all articles

Why Now for AI?

See No Evil - Data Privacy and Ethical Data Use

Solution

Hear No Evil - Security and Feedback Management

Solution

Speak No Evil - Addressing Hallucinations and Misinformation

Solution : Understanding and Diagnosing Hallucinations

References

AI RIFFS

881 位关注者

Rahul Mathur的更多文章

It Gets More Creative as It Gets Hotter!

Optimizing retrievers for AI

Vertical AI - New Era of Boom

AI's FLOPS Show!

Add "FLAIR": Applying AI in FAIR Data Principles