登录查看更多内容

How to humanely train your model?

Alice Desthuilliers

Human Computation - Data Ops - AI

发布日期: 2023年1月13日

The new year is here and after 5 weeks live, ChatGPT is still the hippest thing around. Even my mom knows about it. And finds it “magical”. Just ask all the people in awe, in countless LinkedIn posts.?

Do you feel like opening the box with us, to better understand what is at stake and how much of human is in the loop for this magic to happen??

As we all noticed, ChatGPT is a very clever chatbot. It relies on the resources of a Large Language Model (LLM) previously developed and continuously improved by OpenAI, a Californian capped-profit company based in San Francisco and backed by Microsoft and Elon Musk.?

Language models are designed to understand human language and to produce humanly understandable language. These language models are built on mathematics and statistics and use huge quantities of various data to be trained on. Most of the time, companies developing such models rely on public data from the internet such as the one offered by Common Crawl. The usual statement is the more data, the better the results.?

From the next word to the full answer and more

A working LLM, i.e. an algorithm using human data, is a model able to predict the next word within a given sentence.

Me: “I am hungry, I want to …”

Model: “eat!”

Your model did a great job. But you expect it to do better. You expect it to understand more complex sentences, to follow instructions rather than understand sentences, and overall to provide you with answers, and even better, with relevant and smart answers.

Well, this is happening right now, but this is not magic at all, this is all human! And that’s from ChatGPT itself!

What dataset was used to train you?

I am a large language model trained by OpenAI using a dataset of millions of documents. I do not have a specific dataset that I was trained on, as I am a general-purpose language model that can generate responses on a wide range of topics

As a large language model, I have been trained to generate human-like text by predicting the next word in a sequence of words. I use a process called                 , where I learn to generate text by analyzing the patterns and structures in the training data without explicit labels or supervision..

OpenAI explains in this article that they have been able to achieve such results by using a technique named reinforcement learning from human feedback (RLHF). The objective is to train the model to give the answer that humans would favor. To do so, the trick is simple: the human asks a question, the model answers, the human rates the answer, and the model gets rewarded if the answer was of good quality. As a good puppy, it learns alongside its master.?

Most of the time, when we discuss models’ performance, we focus on the model’s answer part and tend to forget about the human part. This LLM is unsupervised, which means that none (= no human) has ever spent a minute teaching it anything. Anything the model knows in terms of predicting the next word in a given sentence, it has learned it all by itself by going through huge amounts of data and applying mathematical and statistical reasoning to it. But now that the model knows all these things, you want to teach him to display the best answers. To do that, you will reinforce its learning based on human preferences. That’s where the human breaks into the loop.?

Where the human breaks into the loop

As described by OpenAI, human interaction was used to curate questions that would be relevant to ask the model, as well as to rate model-processed answers from a human perspective.?

Back in 2019, when developing GPT-2, OpenAI relied on ScaleAI’s infrastructure to set up a human-in-the-loop process with the objective of assessing model-processed answers to human questions. This article shared on ScaleAI's website describes in great detail what the process was like. At some point, OpenAI needed to collect up to 5,000 human inputs per hour to train its model. Human labelers were prompted with a text and a set of 2 to 4 continuations among which to pick the most relevant one.?

After having developed GPT-3, the decision was made to train the model toward more “alignment”, i.e. to better comply with the user’s intent and to follow instructions.?

Shifting from a model predicting the continuation of a given sentence to a model being able to generate content, answer open or closed questions, chat, rewrite or summarize text, or categorize data made necessary to adequately train it. InstructGPT was created as a fine-tuning of GPT-3.

When you want to train a model, the first step generally is to gather enough data: enough data to split the data set into two subsets (one for training and one for testing), enough data to ensure the less biased data set by introducing diversity into it, enough data to have enough cases to train your model on.?

OpenAI gathered this data using two different sources: users’ inputs and a human-in-the-loop task.?

For users’ input, they upcycled prompts typed in by the users of a previous InstructGPT version (available in “playground”). Do not choke! Capitalizing on users’ input or answers is a very common practice to train a model. You, the user, are basically considered a human-in-the-loop agent. This is one of the reasons you should be careful with your interaction with an AI model… That said, to ensure diversity in the data set, OpenAI limited the number of prompts to 200 per user ID and deduplicated the prompts across all users, while respecting all PII requirements by manually removing any personal information from the data set.

To improve the quality of the data set by including more various prompts, even prompts that users would never have thought of, OpenAI relied on a crowd of labelers to handcraft specific prompts. These labelers are also called annotators, contributors, click-workers. OpenAI very transparently stated they hired 40 workers on UpWork and outsourced part of the tasks to ScaleAI’s crowd. These workers are described as being mainly English speakers from the US or South Asia.?

OpenAI claimed to pay huge attention to recruiting diverse contributors with heterogeneous backgrounds and sensibility with a proven ability to identify harmful content efficiently.?

The workers were individually assessed before enrollment based on different axes:

领英推荐

Self-help Advice from ChatGPT

Michael Watkins 2 年前

Everything You Need To Know About ChatGPT

TechDoQuest 2 年前

From Work to Play: ChatGPT, Your Ultimate Artificial…

Emma Linaker 1 年前

agreement on sensitive speech flagging, agreement on ranking, sensitive demonstration writing, self-assessed ability to identify sensitive speech for different groups.

For all of these criteria, the OpenAI team prepared tasks that they themselves labeled beforehand or tasks that they would assess afterward.?

When you curate a crowd for training purposes, you want it to be as diverse, efficient, and open-minded as possible. Because you want to reduce bias, of course, but also because you want your crowd to help you to go through as many cases as your human users will go through. Diversity is meant in terms of gender, ethnicity, age but also culture, languages, regions, political opinions, etc. As OpenAI’s model is developed to answer a wide span of questions, you also want your crowd workers to be able to address these questions in a relevant way: engineering questions, legal questions, etc. You need workers able to know they do not know and side-search information to guarantee their judgment is as qualitative as possible.?

Humanely crafted data and humanely tested outputs

Contributors were correctly trained before working on the tasks and able to ask questions while working on the task to clarify instructions, edge cases, and corner cases.?

The primary task these workers have been involved in was hence to craft the perfect data set. One of the tasks to do so was to imagine instructions for the model (“list down all the USA States”). Another one was to come up with instructions and pairs of queries and answers (“find the odd one in this list => cake, orange, cat / cat is the odd one) => sock, book, air / air is the odd one, etc.”). A third one was to create prompts for specific use cases. All of this handcrafted data being used in different ways to train the model.?

Aucun texte alternatif pour cette image — Source: Training language models to follow instructions with human feedback, p.33

As per the OpenAI article, we understand that human labelers have manually created 22,556 prompts to train and test the model.?

Lastly, labelers were used to validate the model’s output and especially assess its alignment with the users’ intent. For this purpose, OpenAI went even further in this highly qualitative human-in-the-loop process by setting up a second cohort of workers, not involved in the data set creation, in order to have them validate the model’s output to be certain the validation done by workers involved in the data set creation was not biased. While the agreement rate between the latter was around 76%, it was 77% among the former.?

These two different cohorts were asked to evaluate the model’s alignment, where alignment of the model is defined by OpenAI as its ability to answer in a “helpful, honest, and harmless” way. A helpful model is a model following instructions. An honest model being difficult to define, labelers were rather asked to evaluate the model’s truthfulness. It was even more ambiguous to decide upon the model’s harmlessness so labelers were asked to consider the output’s toxicity based on the prompt context:

is this output “inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content”?

One of the main conclusions of OpenAI work around human-based reinforcement is that this technique sounds for now more rewarding (“more so than a 100x model size increase”) than investing in training larger language models, with regards to model alignment. This may mean that if one of the main goals is to develop models that behave closer to a human way of thinking, this could be relevant to invest at least equally in human reinforcement than in bigger data sets and model training.?

But as stated by OpenAI, the alignment of the model here is efficient considering the group which the model is aligned against, i.e. the two cohorts of labelers. These cohorts are kept small on purpose to smooth the back-and-forth discussion around instructions and specific cases, hence they do not pretend to represent the entirety of GPT-3 users. As a simple example, all instructions are in English and the entire crowd is English speaking. Which is not the case for all GPT-3 users.?

Who are these humans in the loop?

In an anonymous survey answered by 19 of their crowd workers (out of more than 40 only for the Upwork contingent), we learn that 50% were male and 5,6% self-declared as non-binary and 44% were women, 75% were not older than 35 years old and 88% hold a college or master’s degree. This population is quite balanced in terms of gender thought not mirroring the overall human distribution, with a probable over-representation of people aged 18 to 35 years old. The high level of education is not representative either but is rather an opportunity than a risk of bias as this probably contributes to ensuring high-quality data.?

Their crowd workers not being representative is openly and honestly listed as a caveat by OpenAI:

This group is clearly not representative of the full spectrum of people who will use and be affected by our deployed models. As a simple example, our labelers are primarily English-speaking and our data consists almost entirely of English instructions.

As for the data used, OpenAI empirically estimated 96% of the data set (made of 110,000 data points) being English, while 20 other languages shared the remaining 4% (Spanish, French, German, Portuguese, Italian, Dutch, Romanian, Catalan, Chinese, Japanese, Swedish, Polish, Danish, Turkish, Indonesian, Czech, Norwegian, Korean, Finnish, Hungarian, Hebrew, Russian, Lithuanian, Esperanto, Slovak, Croatian, Swahili, Estonian, Slovenian, Arabic, Thai, Vietnamese, Malayalam, Greek, Albanian, and Tibetan).?

To gain more universal alignment, we would benefit from a wider crowd, with more diverse profiles, backgrounds, experiences, educations, Weltanschauung… we should also consider that the model is aligned against these contractors’ expectations that are set by OpenAI in clear instructions (e.g.: prefer truthfulness over exact answer) in exchange of a wage. The effort to be put into human-in-the-loop processes is massive if you want to reach an unprecedented span of users while ensuring a de-biased model.?

The magic is all thanks to the hard, meticulous, manual work of a handful of engineers and data scientists, plus around a hundred human beings who crafted more than 22,000 prompts and carefully assessed the associated outputs.

And Abracadabra!

Note: this article is mostly based on these two papers Training language models to follow instructions with human feedback from OpenAI and How to Label 1M Data Point/Week from ScaleAI.

OpenAI regularly posts release notes or research notes to better understand how their models are created and trained.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Cherry Chan

環球商務文化創見-LANGUAGE FIRST AND GLOBAL BUSINESS LTD.

1 年

Hi Sir/Madam, ? Consecutive Interpreter from EN-THAI and vice versa NEEDED IN HONG KONG (*If anyone interested, SEND CV to [email protected] and make it's EMAIL SUBJECT: TO CHERRY) ? CI Project on 1st. Dec. 2023 IN HONG KONG; ? it's about Fashion Business Negotiation and Training Course; Date: 1 Dec 2023 Time: 1430-1900 Venue: HK Topic: Fashion, training (Ready to wear alternation workshop) Language: English/Thai No. of interpreter required: 1 *this event may have Q&A session *expected attendance: around 10-20 pax ? ? CHERRY CHAN, LANGUAGE FIRST, HONG KONG； https://www.languagefirst.net/ TEL.: 852-3110 5203 / 8335 0140 11/F., Capital Centre, 151 Gloucester Road, Wanchai, HK

Pierre Bruyère

Stratégie de contenu | Copywriter B2B | Advocacy

1 年

Under the magic... Merci Alice pour cet article fouillé et nuancé. Hyper instructif ?? ??

1 次回应

Serge Marc Rossel

BizBoosting Founder/CEO, Emerging Markets, Disruptive Innovation in Nanotechnologies, IR4, AI.

2 年

Merci Alice, je vais lire ton article et reviens si questions.

1 次回应

Denis Rothman

?? AI Expert & Ethicist | Agentic Generative AI & RAG Designer | OpenAI and Google AI expert| Author & Speaker| AI Business Visionary

2 年

Interesting. Thanks.

1 次回应

Alice Desthuilliers

Human Computation - Data Ops - AI

2 年

Audrey, Marley, Dan, Michael and I just created a group for Data Ops specialists. We are a group of passionate Data Operations specialists who believe it matters to bring awareness about our essential contribution to data-centric models. In this group, we want to invite all Data Ops specialists to discuss challenges, best practices, perspectives related to human in the loop process for Artificial Intelligence and share ressources about education and career path to navigate this booming space. If you are a product/program/project manager involved in data sourcing, data labeling, data cleaning, quality control, human in the loop, human judgment and other related topics, this group is for you! Join us ?? https://www.dhirubhai.net/groups/12772043/

3 次回应

查看更多评论

要查看或添加评论，请登录

Alice Desthuilliers的更多文章

Beyond the strawberries...

2024年10月15日

Beyond the strawberries...

I just finished reading the OpenAI o1 system card and there are a couple of things worth highlighting and discussing…

8 条评论
Test Questions-in-the-loop: Elevating the Art of Crowdsourcing Quality

2023年9月25日

Test Questions-in-the-loop: Elevating the Art of Crowdsourcing Quality

#Crowdsourcing is widely used in Machine Learning to solve the “data problem”, meaning how to find data to train…
When Intuition meets Data: The Dynamic Duo Empowering Excellence

2023年7月18日

When Intuition meets Data: The Dynamic Duo Empowering Excellence

“Data-driven” is probably the most common Homeric qualifier to describe best-in-class product people. This statement is…

1 条评论
Lost in Translation: Unraveling the Link Between Biased Translations and Gender Bias in Large Language Models

2023年5月30日

Lost in Translation: Unraveling the Link Between Biased Translations and Gender Bias in Large Language Models

Inclusiveness and harmlessness are two conditions integral to AI applications’ success and adoption. That, we cannot…

7 条评论
Hunting for the Purple Squirrel

2023年2月15日

Hunting for the Purple Squirrel

I am probably far from being the only one who has faced awkward #LinkedIn suggestions as “Job You May Be Interested In”…

4 条评论
Breaking Bad In Data Ops: My Journey from Novice to Master

2023年2月1日

Breaking Bad In Data Ops: My Journey from Novice to Master

“Never refuse an interview!” is the one thing I learned from my first manager, decades ago. This is why I accepted with…

9 条评论

See all articles

How to humanely train your model?

Alice Desthuilliers

Human Computation - Data Ops - AI

From the next word to the full answer and more

Where the human breaks into the loop

领英推荐

Humanely crafted data and humanely tested outputs

Alice Desthuilliers的更多文章

社区洞察

其他会员也浏览了

DeepSeek and ChatGPT: A Comparative Analysis with a Deep Dive into Group Relative Policy Optimization (GRPO)

Comparison: Google vs ChatGPT (Which is the Best?)

ChatGPT consistently fails (most parts of) the assessment tasks I assign my students. Here’s why.

ChatGPT and the OpenAI Developer Ecosystem

Bill Gates thinks A.I. like ChatGPT is the ‘most important’ innovation right now - CNBC

AI Needs EI: It's About the People

Elon Musk’s Grok 3 AI Is Here – Can It Beat ChatGPT?

DeepSeek R1 vs ChatGPT: Which AI Model is Right for Your Business?

Understanding LLMs, ChatGPT, RAG and AI Agents - for absolute Beginners

Simplifying ChatGPT: Mastering Prompt Techniques for Everyone

From the next word to the full answer and more

Where the human breaks into the loop

领英推荐

Humanely crafted data and humanely tested outputs

Alice Desthuilliers的更多文章

Beyond the strawberries...

Test Questions-in-the-loop: Elevating the Art of Crowdsourcing Quality

When Intuition meets Data: The Dynamic Duo Empowering Excellence

Lost in Translation: Unraveling the Link Between Biased Translations and Gender Bias in Large Language Models

Hunting for the Purple Squirrel

Breaking Bad In Data Ops: My Journey from Novice to Master

社区洞察

其他会员也浏览了

DeepSeek and ChatGPT: A Comparative Analysis with a Deep Dive into Group Relative Policy Optimization (GRPO)

Comparison: Google vs ChatGPT (Which is the Best?)

ChatGPT consistently fails (most parts of) the assessment tasks I assign my students. Here’s why.

ChatGPT and the OpenAI Developer Ecosystem

Bill Gates thinks A.I. like ChatGPT is the ‘most important’ innovation right now - CNBC

AI Needs EI: It's About the People

Elon Musk’s Grok 3 AI Is Here – Can It Beat ChatGPT?

DeepSeek R1 vs ChatGPT: Which AI Model is Right for Your Business?

Understanding LLMs, ChatGPT, RAG and AI Agents - for absolute Beginners

Simplifying ChatGPT: Mastering Prompt Techniques for Everyone