Data Behind ChatGPT

Data Behind ChatGPT

Since ChatGPT was launched in November 2022, it sparked a lot of excitement and interest among people who were curious about how the model was developed. Others were curious about how the model can be used. In this article, I will shift the focus to the model's data aspect, including the training datasets used, and the methods used to prepare and label the data.

What is ChatGPT

ChatGPT is a Language model developed by OpenAI and is trained over a massive dataset to predict the next word in the sequence given the previous words and predict the next sentence given the previous sentences.

ChatGPT underwent multiple training processes including:

  • Fine-tuning the model using a dataset that consisted of pairs of prompts and corresponding answers. The answers were provided by expert human labelers.
  • Reinforcement Learning from Human Feedback, is a technique in deep reinforcement learning that incorporates human feedback into the learning process. In this approach, multiple responses were generated by the model for a given prompt, and a human labeler ranked them from best to worst. The ranked data was then utilized for training the model.

Training Data

ChatGPT was trained on an enormous amount of textual data, which comprised text databases obtained from various online sources. The training set involved staggering massive data, and the model was designed with 175 billion trainable parameters. The training set was curated from multiple sources such as books, web texts, Wikipedia, articles, and other written material available on the internet.

No alt text provided for this image
Datasets

About 300 billion words (499 billion tokens) were fed into the model. In NLP, the terms "word" and "token" are often used interchangeably, but they have slightly different meanings.

To illustrate the difference between them, let us consider the sentence “He is going to school.”. There are five words: "He" “is” "going" "to" and "school”. ?While there are six tokens: "He" “is” "going" "to" "school” and “.”

Data Preparation and Data Labeling

Undoubtedly, preparing the massive amount of data needed to train ChatGPT required a significant effort. To accomplish this task, OpenAPI hired a large number of contractors who were responsible for labeling and preparing each piece of data. Therefore, OpenAI partnered with Sama, a San Francisco-based firm that employs workers in Kenya, Uganda, and India to label data for Silicon Valley clients like Google, Meta, and Microsoft. Sama outsourced thousands of data labelers from Kenya. The employees were paid between $1.32 to $2 per hour to label text. The annotated data included various tasks such as Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), Sentiment Analysis, Coreference Resolution, and Text Classification. Here are some data annotation examples:

  • Part-of-speech (POS) tagging: ?it is used for labeling each word in a sentence with its corresponding part of speech (e.g., noun, verb, adjective) to help ChatGPT understand the sentence grammar and connections between words. For example, in the sentence "He ran to catch the bus," ChatGPT would be able to recognize each word as follows:

- "He" is a personal pronoun (PRP)

- "ran" is a verb in the past tense (VBD)

- "to" is a preposition (TO)

- "catch" is a verb in the base form (VB)

- "the" is a definite article (DT)

- "bus" is a noun (NN)

  • Named Entity Recognition (NER): it is used for identifying and tagging named entities (e.g., people, organizations, locations, Dates, etc.) in a sentence. It helped ChatGPT understand the meaning of words and respond more accurately to questions. For example, in the sentence "Ahmed saw Khalifa Tower on his trip to Dubai". ChatGPT would recognize "Khalifa Tower" as a named entity and understand that it is referring to a location. Dubai is also NER referred to location and Ahmed is NER referred to people.
  • ?Sentiment Analysis: ?involved assigning sentiment labels (e.g., positive, neutral, negative) to text data to understand the emotional tone of a sentence. This is useful in responding to questions related to opinions and emotions to avoid biases and harmful responses. For example, in the sentence "I like Durham university" ChatGPT would recognize a positive sentiment.
  • ??Coreference Resolution: involved identifying and resolving references to the same entity across different parts of a text. This helped ChatGPT understand the context of a sentence and respond more coherently to questions. For example, in the sentence "Ahmed saw Maria in the park. He waved to her," ChatGPT would recognize that "he" and "her" both refer to Ahmed and Maria, respectively.
  • Text Classification: involved labeling text data into predefined categories (e.g., news articles, product reviews, Sports articles, etc). This helped ChatGPT understand the genre or topic of a text and generate more relevant responses. For example, if ChatGPT receives a query from the user about Sports, it would recognize it as such and generate responses that are appropriate for a sports article.


References:

https://blog.invgate.com/chatgpt-statistics

https://time.com/6247678/openai-chatgpt-kenya-workers/

https://towardsdatascience.com/how-chatgpt-works-the-models-behind-the-bot

Nazia Khan

Founder & CEO SimpleAccounts.io at Data Innovation Technologies | Partner & Director of Strategic Planning & Relations at HiveWorx

8 个月

Khaled, thanks for sharing!

回复
Mohamed Ghazala

Advisory Manager Data Transformation, Analytics & AI - Digital Lighthouse - Centre of Excellence for Data, AI & Emerging Technologies

1 年

Thanks for sharing the knowledge

要查看或添加评论,请登录

社区洞察

其他会员也浏览了