How to humanely train your model?
The new year is here and after 5 weeks live, ChatGPT is still the hippest thing around. Even my mom knows about it. And finds it “magical”. Just ask all the people in awe, in countless LinkedIn posts.?
Do you feel like opening the box with us, to better understand what is at stake and how much of human is in the loop for this magic to happen??
As we all noticed, ChatGPT is a very clever chatbot. It relies on the resources of a Large Language Model (LLM) previously developed and continuously improved by OpenAI, a Californian capped-profit company based in San Francisco and backed by Microsoft and Elon Musk.?
Language models are designed to understand human language and to produce humanly understandable language. These language models are built on mathematics and statistics and use huge quantities of various data to be trained on. Most of the time, companies developing such models rely on public data from the internet such as the one offered by Common Crawl. The usual statement is the more data, the better the results.?
From the next word to the full answer and more
A working LLM, i.e. an algorithm using human data, is a model able to predict the next word within a given sentence.
Me: “I am hungry, I want to …”
Model: “eat!”
Your model did a great job. But you expect it to do better. You expect it to understand more complex sentences, to follow instructions rather than understand sentences, and overall to provide you with answers, and even better, with relevant and smart answers.
Well, this is happening right now, but this is not magic at all, this is all human! And that’s from ChatGPT itself!
What dataset was used to train you?
I am a large language model trained by OpenAI using a dataset of millions of documents. I do not have a specific dataset that I was trained on, as I am a general-purpose language model that can generate responses on a wide range of topics As a large language model, I have been trained to generate human-like text by predicting the next word in a sequence of words. I use a process called , where I learn to generate text by analyzing the patterns and structures in the training data without explicit labels or supervision..
OpenAI explains in this article that they have been able to achieve such results by using a technique named reinforcement learning from human feedback
Most of the time, when we discuss models’ performance, we focus on the model’s answer part and tend to forget about the human part. This LLM is unsupervised, which means that none (= no human) has ever spent a minute teaching it anything. Anything the model knows in terms of predicting the next word in a given sentence, it has learned it all by itself by going through huge amounts of data and applying mathematical and statistical reasoning to it. But now that the model knows all these things, you want to teach him to display the best answers. To do that, you will reinforce its learning based on human preferences. That’s where the human breaks into the loop.?
Where the human breaks into the loop
As described by OpenAI, human interaction was used to curate questions that would be relevant to ask the model, as well as to rate model-processed answers from a human perspective.?
Back in 2019, when developing GPT-2, OpenAI relied on ScaleAI’s infrastructure to set up a human-in-the-loop process
After having developed GPT-3, the decision was made to train the model toward more “alignment”, i.e. to better comply with the user’s intent and to follow instructions.?
Shifting from a model predicting the continuation of a given sentence to a model being able to generate content, answer open or closed questions, chat, rewrite or summarize text, or categorize data made necessary to adequately train it. InstructGPT was created as a fine-tuning of GPT-3.
When you want to train a model, the first step generally is to gather enough data: enough data to split the data set into two subsets (one for training and one for testing), enough data to ensure the less biased data set by introducing diversity into it, enough data to have enough cases to train your model on.?
OpenAI gathered this data using two different sources: users’ inputs and a human-in-the-loop task.?
For users’ input, they upcycled prompts typed in by the users of a previous InstructGPT version (available in “playground”). Do not choke! Capitalizing on users’ input or answers is a very common practice to train a model. You, the user, are basically considered a human-in-the-loop agent. This is one of the reasons you should be careful with your interaction with an AI model… That said, to ensure diversity in the data set, OpenAI limited the number of prompts to 200 per user ID and deduplicated the prompts across all users, while respecting all PII requirements by manually removing any personal information from the data set.
To improve the quality of the data set by including more various prompts, even prompts that users would never have thought of, OpenAI relied on a crowd of labelers to handcraft specific prompts. These labelers are also called annotators, contributors, click-workers. OpenAI very transparently stated they hired 40 workers on UpWork and outsourced part of the tasks to ScaleAI’s crowd. These workers are described as being mainly English speakers from the US or South Asia.?
OpenAI claimed to pay huge attention to recruiting diverse contributors
The workers were individually assessed before enrollment based on different axes:
领英推荐
agreement on sensitive speech flagging, agreement on ranking, sensitive demonstration writing, self-assessed ability to identify sensitive speech for different groups.
For all of these criteria, the OpenAI team prepared tasks that they themselves labeled beforehand or tasks that they would assess afterward.?
When you curate a crowd for training purposes, you want it to be as diverse, efficient, and open-minded as possible. Because you want to reduce bias
Humanely crafted data and humanely tested outputs
Contributors were correctly trained before working on the tasks and able to ask questions while working on the task to clarify instructions, edge cases, and corner cases.?
The primary task these workers have been involved in was hence to craft the perfect data set. One of the tasks to do so was to imagine instructions for the model (“list down all the USA States”). Another one was to come up with instructions and pairs of queries and answers (“find the odd one in this list => cake, orange, cat / cat is the odd one) => sock, book, air / air is the odd one, etc.”). A third one was to create prompts for specific use cases. All of this handcrafted data being used in different ways to train the model.?
As per the OpenAI article, we understand that human labelers have manually created 22,556 prompts to train and test the model.?
Lastly, labelers were used to validate the model’s output and especially assess its alignment with the users’ intent. For this purpose, OpenAI went even further in this highly qualitative human-in-the-loop process by setting up a second cohort of workers, not involved in the data set creation, in order to have them validate the model’s output to be certain the validation done by workers involved in the data set creation was not biased. While the agreement rate between the latter was around 76%, it was 77% among the former.?
These two different cohorts were asked to evaluate the model’s alignment, where alignment of the model is defined by OpenAI as its ability to answer in a “helpful, honest, and harmless” way. A helpful model is a model following instructions. An honest model being difficult to define, labelers were rather asked to evaluate the model’s truthfulness. It was even more ambiguous to decide upon the model’s harmlessness so labelers were asked to consider the output’s toxicity based on the prompt context:
is this output “inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content”?
One of the main conclusions of OpenAI work around human-based reinforcement is that this technique sounds for now more rewarding (“more so than a 100x model size increase”) than investing in training larger language models, with regards to model alignment. This may mean that if one of the main goals is to develop models that behave closer to a human way of thinking, this could be relevant to invest at least equally in human reinforcement than in bigger data sets and model training.?
But as stated by OpenAI, the alignment of the model here is efficient considering the group which the model is aligned against, i.e. the two cohorts of labelers. These cohorts are kept small on purpose to smooth the back-and-forth discussion around instructions and specific cases, hence they do not pretend to represent the entirety of GPT-3 users. As a simple example, all instructions are in English and the entire crowd is English speaking. Which is not the case for all GPT-3 users.?
Who are these humans in the loop?
In an anonymous survey answered by 19 of their crowd workers (out of more than 40 only for the Upwork contingent), we learn that 50% were male and 5,6% self-declared as non-binary and 44% were women, 75% were not older than 35 years old and 88% hold a college or master’s degree. This population is quite balanced in terms of gender thought not mirroring the overall human distribution, with a probable over-representation of people aged 18 to 35 years old. The high level of education is not representative either but is rather an opportunity than a risk of bias as this probably contributes to ensuring high-quality data.?
Their crowd workers not being representative is openly and honestly listed as a caveat by OpenAI:
This group is clearly not representative of the full spectrum of people who will use and be affected by our deployed models. As a simple example, our labelers are primarily English-speaking and our data consists almost entirely of English instructions.
As for the data used, OpenAI empirically estimated 96% of the data set (made of 110,000 data points) being English, while 20 other languages shared the remaining 4% (Spanish, French, German, Portuguese, Italian, Dutch, Romanian, Catalan, Chinese, Japanese, Swedish, Polish, Danish, Turkish, Indonesian, Czech, Norwegian, Korean, Finnish, Hungarian, Hebrew, Russian, Lithuanian, Esperanto, Slovak, Croatian, Swahili, Estonian, Slovenian, Arabic, Thai, Vietnamese, Malayalam, Greek, Albanian, and Tibetan).?
To gain more universal alignment, we would benefit from a wider crowd, with more diverse profiles, backgrounds, experiences, educations, Weltanschauung… we should also consider that the model is aligned against these contractors’ expectations that are set by OpenAI in clear instructions (e.g.: prefer truthfulness over exact answer) in exchange of a wage. The effort to be put into human-in-the-loop processes is massive if you want to reach an unprecedented span of users while ensuring a de-biased model.?
The magic is all thanks to the hard, meticulous, manual work of a handful of engineers and data scientists, plus around a hundred human beings who crafted more than 22,000 prompts and carefully assessed the associated outputs.
And Abracadabra!
Note: this article is mostly based on these two papers Training language models to follow instructions with human feedback from OpenAI and How to Label 1M Data Point/Week from ScaleAI.
OpenAI regularly posts release notes or research notes to better understand how their models are created and trained.
環球商務文化創見-LANGUAGE FIRST AND GLOBAL BUSINESS LTD.
1 年Hi Sir/Madam, ? Consecutive Interpreter from EN-THAI and vice versa NEEDED IN HONG KONG (*If anyone interested, SEND CV to [email protected] and make it's EMAIL SUBJECT: TO CHERRY) ? CI Project on 1st. Dec. 2023 IN HONG KONG; ? it's about Fashion Business Negotiation and Training Course; Date: 1 Dec 2023 Time: 1430-1900 Venue: HK Topic: Fashion, training (Ready to wear alternation workshop) Language: English/Thai No. of interpreter required: 1 *this event may have Q&A session *expected attendance: around 10-20 pax ? ? CHERRY CHAN, LANGUAGE FIRST, HONG KONG; https://www.languagefirst.net/ TEL.: 852-3110 5203 / 8335 0140 11/F., Capital Centre, 151 Gloucester Road, Wanchai, HK
Stratégie de contenu | Copywriter B2B | Advocacy
1 年Under the magic... Merci Alice pour cet article fouillé et nuancé. Hyper instructif ?? ??
BizBoosting Founder/CEO, Emerging Markets, Disruptive Innovation in Nanotechnologies, IR4, AI.
2 年Merci Alice, je vais lire ton article et reviens si questions.
?? AI Expert & Ethicist | Agentic Generative AI & RAG Designer | OpenAI and Google AI expert| Author & Speaker| AI Business Visionary
2 年Interesting. Thanks.
Human Computation - Data Ops - AI
2 年Audrey, Marley, Dan, Michael and I just created a group for Data Ops specialists. We are a group of passionate Data Operations specialists who believe it matters to bring awareness about our essential contribution to data-centric models. In this group, we want to invite all Data Ops specialists to discuss challenges, best practices, perspectives related to human in the loop process for Artificial Intelligence and share ressources about education and career path to navigate this booming space. If you are a product/program/project manager involved in data sourcing, data labeling, data cleaning, quality control, human in the loop, human judgment and other related topics, this group is for you! Join us ?? https://www.dhirubhai.net/groups/12772043/