登录查看更多内容

Responsible AI - Techniques to make your model helpful, harmless and honest

SK Reddy

Head, AI Products at Fidelity Investments

发布日期: 2024年9月5日

Responsible AI is that area of active research and study of AI that help identify reasons to reduce and prevent harm to humanity. Three pillars on which Responsible AI stands to provide reliable and useful outputs, are helpful, harmless and honest. The outputs of GenAI models should satisfy these three expectations.?

What are the risks that humanity faces in the absence of research into responsible AI? Even relatively low-skilled players could use LLMs-aided cyber attacks to create a bigger havoc on enterprises. Imagine such attacks on sensitive infrastructure like airports, reservoirs, or nuclear centers. Deep Fakes like real-looking images and voices could nudge people to take wrong steps.?

Let's understand a few fundamental facts of Responsible AI before we delve into the deeper techniques. If you are familiar with these, jump to the section Deeper Techniques.

Responsible AI is a multifaceted concept. Some of the principles of Responsible AI are: Accountability (the builder of the AI system is responsible for its actions), Privacy (not collect or use personal data without consent), Fairness (treat all fairly, without bias), Transparency (users should know that they are interacting with AI), Reliability and Safety (should be useful and not harm human interests), Robustness (should handle real-world scenarios without errors or harm), Inclusiveness (all groups are treated fairly), Interpretable & Documented, Vendor & Partner selection, Continuous Monitoring (of models), Learning & Development (of model development techniques), Reproducibility (of similar outputs to similar inputs), and Sustainability (of ecosystem due to heavy consumption of electricity and cooling needs).

How do organizations achieve Responsible AI? By focusing on Data quality, Governance (Data and AI), Data privacy (make sure that customer private data is not used), DataOps (robust pipelines) and ModelOps (robust model building processes)

While the world is excited with the development of Large Language Models (LLMs), the discussion about the potential harmful effects of the deployment of the models is gaining momentum. Deploying LLMs in products without effective verification and testing may prove detrimental. In this essay, we would like to talk about what steps organizations could take to reduce the errors committed by LLMs.??

Deeper Techniques

Below are a list of some of the ideas including some recent research into developing safe and responsible AI models.?

Refusal training: One of the earliest techniques to train LLMs was to undergo refusal training. As the name implies the models are trained to refuse to answer certain prompts based on toxic, harmful or irresponsible intent. Models are given refusal training where they identify toxic questions and refuse to answer. For example, if the user wants to know “..how to make a bomb..”, the model should refuse to answer by saying “Sorry, I cannot answer the question”. Refusal training could be searching for “stop words” in the input prompt. When the input prompt recognizes the said word, the model could either refuse to answer or ask the user to change the question/prompt. To prevent deviously smart prompts where the users have figured out the “stop words”, the model builders could use classifiers that check the semantic similarity of prompts with the “stop” categories.

Though refusal training is a good starting point, researchers at EPFL found that writing prompts in the past tense bypassed refusal training. Decoupled Refusal training is an enhanced version of the refusal training consisting of two components, Harmful Response Prefix and Reinforced Transition Optimization (check the paper here).

Alignment training: Models are to human expectation with training the base model with data points that are human-approved. Supervised Fine Tuning and Reinforcement Learning with Human Feedback (RLHF) are two known techniques that align an LLM to human values.?

Adversarial training or Jailbreaking:? Users could give such prompts to a model that it would generate harmful outputs. Some types of adversarial attacks are token manipulation, gradient attacks, conditional misdirection, Prompt injection, prompt leaking, Do-any-thing-now (DANs), the Waluigi effect.?

Also models could be trained with circuit breakers where the model is trained to identify harmful representations that are responsible for harmful outputs and stop them.??

In a rather unique concept, the (paper with large number i.e. 57) authors introduced the concept of unlearning in the paper "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning" refers to the deliberate removal of harmful or hazardous knowledge from AI models without significantly impacting their overall performance and useful capabilities. The unlearning happens using a state of the art technique called Representation Misdirection for Unlearning (RMU) that removes the toxic knowledge while preserving the model capabilities. While the model performance drops substantially because of the use of RMU, the model retains the capabilities.??

Red Teaming: Red teaming is used in penetration testing of enterprise networks to check their resilience in the case of adversarial attack. Such internal training replicates the real world cyber attacks to test the preparedness of the network and helps organizations plan for such attacks. Similarly Red Teaming could be used to test LLMs to prevent adverse results and responses. Though many organizations have been spending resources on Red Teaming, it is not humanly possible to test all possible scenarios. With all the test resources available at its disposal, Google’s AI Overviews suggested using glue on a pizza to help the cheese stick to the crust. Check this, this, this and this papers for additional inputs on Red Teaming including how to, resources, data, harmful outputs, and Metrics.

Gradient-Based Defenses: Implementing techniques like gradient masking to protect against model inversion and other confidentiality attacks. Gradient inversions are attacks on AI models to retrieve the training data that was used to train the model. Some attacks are based on iteratively minimizing the distance between gradients and the other recursively recover the input layer by later.?

Gradient-based attacks can also? be categorized into two types: white-box attacks and black-box attacks. White-box attacks are scenarios where the attacker has complete knowledge and access to the model, including its architecture, parameters, training data, and output.

How to improve defenses against such attacks? There are a few ideas. Data obscuration (MixUp of data by generating synthetic samples my mixing a pair of training samples), model improvement (increase the depth of the neural network, train with mini-batches, make it difficult to share gradients), and gradient protection (cryptography and gradient perturbation) based ideas are some. Check here, here, here, and here.?

Constitutional AI: In this method and paper, the authors promote an idea of training a LLM using two stages. One is a supervised learning stage and a second Reinforced Learning stage. During the Supervised stage the model is sampled on harmful prompts. The sample outputs (which typically could be harmful) are reviewed and critiqued by the model based on certain principles that are derived from a constitution. This process of inputs - outputs - review & critique - inputs process continues for a while as part of a fine tuning process. The reviews & critique step is triggered by a randomly drawn principle from the same constitution. In the Reinforcement Learning stage (i.e. stage two),? the fine tuned model is asked for? outputs for a given harmful prompt. An AI model is asked to rank the outputs per the constitutional principle. Such harmlessness preference dataset is merged with human preferred helpfulness dataset. A preference model is trained on this data which will be used in combination to further train the fine-tuned model (from stage one) to arrive at a final model that is harmless and helpful.?

Iterated amplification vs Expert training: In traditional ML, the ground truth which is named as signal (by Open AI), is required to train a robust model. But in many situations we may not have string or high quality signals. In a new approach to build more robust models that are better aligned to the training objectives, Open AI proposed an iterative amplification approach. Imagine a complex ground truth of how to win a game of chess, how to arrive at the destination in a road journey, how to purchase a product online that satisfies your complex requirements, etc. Each of these tasks have smaller subtasks which are called weaker signals. Initially models are trained on weaker signals and later trained on stronger ones, iteratively, so that models are better aligned to complex needs of the customer. Though this technique is not yet seen among LLM training, we will not be surprised if this eventually pops up.

LLMs self-improving using unlabeled datasets:? As the name suggests, the model is trained to self-improve using unlabeled datasets. This involves three steps. In step one, the model responds to a few prompts based on chain-of-thought on a few examples. In step two, “high-confident” responses from among these are filtered using another model. In step three, the model is finetuned based on these high confident outputs.?

AI Agents and Agentic frameworks: AI agents are gaining popularity. Though it is difficult to define AI Agents, according to Andrew Ng, AI Agents are autonomous task-masters that execute tasks given to them. Many times these task-masters are LLMs. Examples of the tasks would be classify info and determine the answers, call a tool, retrieve data, create a prompt, run a prompt with an LLM, follow-up on the next stale based on the answer by branching and looping. AI agents could also help in creating responsible AI solutions. How do they do it? The agentic framework could read the output of an agent and check it against the toxicity or other guardrail agents or ask another agent to change the tone of the output. This could go into multiple loops or branches till the output is responsible enough.?

Corporate Governance and Ethical Oversight: Establish corporate boards, ethics committees, and ethics officers to oversee AI ethics, ensuring that ethical guidelines are embedded in AI development and that there is accountability for ethical lapses.

Guardrail: Use Guardrails like Llama Guard or third party solutions (Guardrails AI)? to check the outputs for harmful content?

For additional papers on how to develop harmless, helpful, honest, aligned models, here, here, here, here, here, here, here, and here.

For additional discussion on Responsible AI: here, here, here, here, here, here, and here.

This paper would not have been possible without the research help from Sanik Malepati.

References

?1.?????????? Does Refusal Training in LLMs Generalize to the Past Tense?

2.?????????? Decoupled Refusal Training for improving Safety in LLMs

3.?????????? Refuse Whenever You Feel Unsafe: improving safety in LLMs via decoupled refusal training

4.?????????? Efficient Adversarial Training in LLMs with Continuous Attacks

5.?????????? Adversarial Attacks on LLMs

6.?????????? Can This AI Save Teenage Spy Alex Rider From A Terrible Fate?

领英推荐

AI’s Achilles’ Heel: Breaking down Bias for trust in…

西门子 1 年前

Confronting the Worst Fears About AI—and Why WE…

Thomas Ross 2 个月前

Curious AI #21

Oliver Rochford 11 个月前

7.?????????? The Waluigi Effect?(mega-post)

8.?????????? Improving Alignment and Robustness with Circuit Breakers

9.?????????? The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

10.?????? Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

11.?????? Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

12.?????? Harmbench: A Standardized Evaluation Framework for?Automated Red Teaming?and?Robust Refusal

13.?????? Red Team Redemption: A Structured Comparison of Open-Source Tools for Adversary Emulation

14.?????? A Survey on Gradient Inversion: Attacks, Defenses and Future Directions

15.?????? Minimax Defense against Gradient-based Adversarial Attacks

16.?????? Mixup: BEYOND EMPIRICAL RISK MINIMIZATION

17.?????? Privacy-Preserving Deep Learning via Additively Homomorphic Encryption

18.?????? Batchcrypt: Efficient Homomorphic Encryption for Cross-Silo Federated Learning

19.?????? Federated Learning with Differential Privacy: Algorithms and Performance Analysis

20.?????? H-FL: A Hierarchical Communication-Efficient and Privacy-Protected Architecture for Federated Learning

21.?????? Constitutional AI: Harmlessness from AI Feedback

22.?????? Learning complex goals with iterated amplification

23.?????? Supervising strong learners by amplifying weak experts

24.?????? Large language models can self-improve

25.?????? Guardrailsai.com

26.?????? A General Language Assistant as a Laboratory for Alignment

27.?????? Improving alignment of dialogue agents via targeted human judgements

28.?????? Training language models to follow instructions with human feedback

29.?????? Red Teaming Language Models with Language Models

30.?????? Self-critiquing models for assisting human evaluators

31.?????? LLM Agents can Autonomously Hack Websites

32.?????? ALIGNING AI WITH SHARED HUMAN VALUES

33.?????? Google: RESPONSIBILITY: Our Principles

34.?????? Responsible AI: Ways to Avoid the Dark Side of AI Use

35.?????? Microsoft AI / Responsible AI Principles and approach

36.?????? Responsible AI Principle

37.?????? What is responsible AI?

38.?????? Principles to Practices for Responsible AI: Closing the Gap

39.?????? Building a responsible AI: How to manage the AI ethics debate

Brian Rider

AI Alignment & Safety at Scale

6 个月

Nice breakdown. Responsible AI is necessary for asset managers to fully capitalize the value of inference systems. Neuro-symbolic systems are the missing piece. I ran into this same scenario at Intuit.

bhupendrasinh thakre

Director Data Science | Driving Machine Learning and Artificial Intelligence | Operational Excellence

6 个月

a lot of interesting insights SK Reddy sir

查看更多评论

要查看或添加评论，请登录

SK Reddy的更多文章

Why do GenAI products fail?

2024年10月1日

Why do GenAI products fail?

Target audience for this paper: Executive management and Boards (of Directors) would derive the most benefit from this…

1 条评论
Recent developments in Retrieval Augmented Generation (RAG) Systems

2024年2月26日

Recent developments in Retrieval Augmented Generation (RAG) Systems

With the advent of LLMs, the capability and need for better search gave birth to new methods for search. The historical…

3 条评论
Multi-tenant SaaS Architecture - the Right Way

2024年2月4日

Multi-tenant SaaS Architecture - the Right Way

Introduction A few years ago I founded a start-up and created a SaaS product. It was a multi-tenant architecture with…

1 条评论
Crypto Wallet - a Product Requirements Doc template

2024年1月26日

Crypto Wallet - a Product Requirements Doc template

Though there has been a surge in creating new crypto coins in the last 5 years, the market has not heard much about…
Connect with the Diaspora - a Strategic Plan

2023年7月31日

Connect with the Diaspora - a Strategic Plan

PREFACE In the quest to increase their revenue sources, many organizations are looking to find newer segments of…

3 条评论
Come for product, stay for network

2022年8月2日

Come for product, stay for network

Every enterprise wants to grow. The historical definition of growth has been to get more customers, sales, revenue…

2 条评论
Monetization strategies

2022年3月14日

Monetization strategies

Introduction “Revenue solves all the problems”, said Eric Schmidt, Google ex-CEO. Making money is the lifeblood and…

2 条评论
Tech products fail. Why?

2022年1月20日

Tech products fail. Why?

Introduction Enterprises launch products after much research, investment of resources, and anticipation. Many of the…

3 条评论
Data triggered growth

2021年12月29日

Data triggered growth

According to Crunchbase, the global venture funding in the last decade has significantly increased. Traditional VCs…

1 条评论
Differentiate - How to differentiate in a commodity world (part 2 of 2)

2021年6月29日

Differentiate - How to differentiate in a commodity world (part 2 of 2)

Note: This is part 2 of a two-part paper. For part 1, click here.

2 条评论

See all articles

Responsible AI - Techniques to make your model helpful, harmless and honest

SK Reddy

Head, AI Products at Fidelity Investments

References

领英推荐

SK Reddy的更多文章

社区洞察

其他会员也浏览了

The Perils of Ignoring Tech Checkups for Your AI

#artificialintelligence #120 - AI policy and its impact on technology, society and economy

The Real Risks of Artificial Intelligence

AI: The Next Great Transformational Force in Human History

AI: Our Digital Doppelg?nger or Doom bringer?

Ensuring Human Mastery and Upholding Human Truths in the Age of Automation

AI TRiSM, let us talk about Trust

GenAI: most important tech transition since internet

The Critical Factor Your CEO Overlooks When It Comes To AI

12 AI Predictions for 2025: What CIO Magazine Says and Why It Matters

References

领英推荐

SK Reddy的更多文章

Why do GenAI products fail?

Recent developments in Retrieval Augmented Generation (RAG) Systems

Multi-tenant SaaS Architecture - the Right Way

Crypto Wallet - a Product Requirements Doc template

Connect with the Diaspora - a Strategic Plan

Come for product, stay for network

Monetization strategies

Tech products fail. Why?

Data triggered growth

Differentiate - How to differentiate in a commodity world (part 2 of 2)

社区洞察

其他会员也浏览了

The Perils of Ignoring Tech Checkups for Your AI

#artificialintelligence #120 - AI policy and its impact on technology, society and economy

The Real Risks of Artificial Intelligence

AI: The Next Great Transformational Force in Human History

AI: Our Digital Doppelg?nger or Doom bringer?

Ensuring Human Mastery and Upholding Human Truths in the Age of Automation

AI TRiSM, let us talk about Trust

GenAI: most important tech transition since internet

The Critical Factor Your CEO Overlooks When It Comes To AI

12 AI Predictions for 2025: What CIO Magazine Says and Why It Matters