Data Behind ChatGPT
Khaled Abousamak, PMP, CDMP
Director | CDO | CAIO | Data Science & Analytics | AI | Machine Learning | AI Governance | BI | Data Management | Data Governance | Data Privacy | Data Strategy | Monetization | Personal Data Protection | Digitalization
Since ChatGPT was launched in November 2022, it sparked a lot of excitement and interest among people who were curious about how the model was developed. Others were curious about how the model can be used. In this article, I will shift the focus to the model's data aspect, including the training datasets used, and the methods used to prepare and label the data.
What is ChatGPT
ChatGPT is a Language model developed by OpenAI and is trained over a massive dataset to predict the next word in the sequence given the previous words and predict the next sentence given the previous sentences.
ChatGPT underwent multiple training processes including:
Training Data
ChatGPT was trained on an enormous amount of textual data, which comprised text databases obtained from various online sources. The training set involved staggering massive data, and the model was designed with 175 billion trainable parameters. The training set was curated from multiple sources such as books, web texts, Wikipedia, articles, and other written material available on the internet.
About 300 billion words (499 billion tokens) were fed into the model. In NLP, the terms "word" and "token" are often used interchangeably, but they have slightly different meanings.
To illustrate the difference between them, let us consider the sentence “He is going to school.”. There are five words: "He" “is” "going" "to" and "school”. ?While there are six tokens: "He" “is” "going" "to" "school” and “.”
Data Preparation and Data Labeling
Undoubtedly, preparing the massive amount of data needed to train ChatGPT required a significant effort. To accomplish this task, OpenAPI hired a large number of contractors who were responsible for labeling and preparing each piece of data. Therefore, OpenAI partnered with Sama, a San Francisco-based firm that employs workers in Kenya, Uganda, and India to label data for Silicon Valley clients like Google, Meta, and Microsoft. Sama outsourced thousands of data labelers from Kenya. The employees were paid between $1.32 to $2 per hour to label text. The annotated data included various tasks such as Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), Sentiment Analysis, Coreference Resolution, and Text Classification. Here are some data annotation examples:
领英推荐
- "He" is a personal pronoun (PRP)
- "ran" is a verb in the past tense (VBD)
- "to" is a preposition (TO)
- "catch" is a verb in the base form (VB)
- "the" is a definite article (DT)
- "bus" is a noun (NN)
References:
https://blog.invgate.com/chatgpt-statistics
https://time.com/6247678/openai-chatgpt-kenya-workers/
https://towardsdatascience.com/how-chatgpt-works-the-models-behind-the-bot
Founder & CEO SimpleAccounts.io at Data Innovation Technologies | Partner & Director of Strategic Planning & Relations at HiveWorx
8 个月Khaled, thanks for sharing!
Advisory Manager Data Transformation, Analytics & AI - Digital Lighthouse - Centre of Excellence for Data, AI & Emerging Technologies
1 年Thanks for sharing the knowledge