Gandhi's 3 wise Monkeys on AI
The philosophy of Gandhi's three wise monkeys—"See no evil, hear no evil, speak no evil"—originates from an ancient Japanese pictorial maxim. It was famously embraced by Mahatma Gandhi to emphasize moral integrity and self-discipline. The monkeys represent a guideline to live ethically by avoiding exposure to, participation in, or propagation of harmful thoughts, words, or actions. Over time, this philosophy has remained relevant in areas like ethics, conflict resolution, and even modern governance.
With the rise of AI, the thought is to apply these principles at its core or foundational layer to address the pressing challenges of security, bias, and privacy, ensuring responsible AI deployment.
Why Now for AI?
AI systems, especially large language models (LLMs), are maturing rapidly, and so are concerns over hallucinations, biases, and the misuse of these technologies. Adapting Gandhi's philosophy could anchor fundamental AI principles, promoting trustworthy, ethical, and unbiased outcomes on use cases.
See No Evil - Data Privacy and Ethical Data Use
Whether you are training these models from scratch or building use cases on existing foundation model, AI systems rely on datasets, often collected from sensitive user interactions. Ensuring that these models "see no evil" means implementing robust data privacy & ethical protocols:
Example Use Case: Preventing LLMs from exposing credit card details or personal identities learned during training.
Solution
The solution? Well, below diagram depicts the high level understanding of how ethical data use and privacy can be managed.
We can classify the solution in 3 important boxes, which are :
a) Guardrails : Implementing robust measures to ensure that privacy is not compromised across all data flows. This includes deploying appropriate algorithms and models to verify that the data being used by the system is ethical and unbiased, thus maintaining integrity throughout the process.
b) Differential Policy : Leveraging mathematical techniques to guarantee the privacy of individual data records, even when publishing aggregate statistics. This approach enables models to be trained or insights to be shared while adhering to compliance requirements, ensuring data privacy and security.
c) Explainable AI : Establishing mechanisms to monitor both open and closed models, ensuring they align with business policies. This involves identifying and addressing instances where models misinterpret or misrepresent information, thereby enhancing accountability and transparency.
Hear No Evil - Security and Feedback Management
As the sizes and scale of the model is increasing, there are ways to attack models and thus these AI solutions must be prone of these attacks. To ensure ethical and secure AI, it is crucial to develop solutions that prevent models from hearing any damaging, or vulnerable information. With growing adoptions, it's not just for enterprises but even for small use-cases/businesses who wants there models to be business centric. There are following ways models can be attacked :
a) Model Poisoning : Models to do things that they were not intended to do. A malicious actor might repeatedly submit skewed or toxic data to a model during feedback collection. If this feedback data isn't carefully monitored and filtered, the model might internalize these unhealthy patterns, leading to biased or harmful outputs. This is particularly concerning as language models are often retrained or fine-tuned with feedback to improve their accuracy or align with user expectations.
When using feedback loops, like those employed in GPT-3.5's Reinforcement Learning with Human Feedback (RLHF), there's a significant risk of introducing vulnerabilities into the training process. Since user inputs can directly influence how models evolve, they could also exploit this system by injecting biased or harmful information.
b) Model Theft : This involves the unauthorized access and theft of an LLM's model. To prevent model theft, secure development and deployment processes should be implemented, and access controls should be strictly enforced.?
c) Adversarial Examples : These are inputs designed to trick LLMs into producing incorrect or harmful output. These can be especially problematic when LLMs are used in security-sensitive applications like fraud detection or spam filtering.
Without proper safeguards, these attacks could compromise the reliability and fairness of future models. It’s like teaching an AI assistant to improve by listening to everyone—but if some voices intentionally spread misinformation, the assistant risks learning the wrong lessons.
Solution
The solution? Well, below diagram depicts the high level understanding of how security can be layered across the peripheral of applied model.
The diagram illustrates a comprehensive approach to addressing security and feedback management challenges in AI systems. By incorporating secure mechanisms like authorization and authentication, it ensures that only trusted users interact with the AI model. Human and AI feedback loops enable the generation of prompts and responses that are case-specific while refining the system to resist malicious inputs.
Additionally, adversarial detection plays a crucial role in identifying and mitigating attacks introduced during training, ensuring the integrity of the model. The process of fine-tuning and secure deployment further strengthens the system against vulnerabilities, while the classified training data (rewarded) contributes to a more robust learning framework. This end-to-end approach not only fortifies the model's security but also improves its resilience to adversarial examples, making it an effective solution to "Hear No Evil" challenges.
Speak No Evil - Addressing Hallucinations and Misinformation
Language models sometimes produce outputs that are plausible-sounding but factually incorrect or misleading—often referred to as "hallucinations." This poses significant challenges in applications requiring accuracy. Hallucination can be categorized in following types :
To mitigate these issues, understanding model behavior and employing specific techniques is essential.
Solution : Understanding and Diagnosing Hallucinations
Open models : Open models (e.g., Llama, Mistral) allow deeper introspection by providing access to internal data like token probabilities. This makes them suitable for applying techniques such as:
Token Confidence Thresholding: Detecting low-confidence tokens as potential hallucination markers. Using techniques like log probability averaging or linear parameter transformations.
Fine-tuning with External Data Sources: Incorporating verified datasets to align model outputs with factual correctness.
Closed models : Models like Claude & GPT are widely used but difficult to understand on what data or behavior they providing while giving response. The only way we can detect hallucination is with validation and human in loop.
There are multiple post processing solutions that can be utilized to mitigate hallucination :
a) Prompt Engineering
Few shot learning : Providing the model with relevant examples in the prompt to guide its response generation, reducing the likelihood of irrelevant or fabricated outputs.
Chain/Tree/Graph-of-Thought : Structuring the reasoning process through sequential or hierarchical prompts, ensuring logical and contextually accurate outputs.
b) RAG : Incorporating external knowledge sources during the response generation process by retrieving factual information from a database or knowledge base, which helps anchor the output in reality.