All You Need to Know Before Using Free GPT3 APIs

All You Need to Know Before Using Free GPT3 APIs

As we all know that OpenAI made some of its APIs available to the public last month. With this, there are many people out there trying to read the documentation and test the GPT3 model performance on their data. This blog is a small summary of how to use OpenAI APIs for fine-tuning a GPT3 model using python, and what are the limitations of these APIs based on my experience so far.

In this article, I have explained how to use the GPT3 APIs in python and fine-tune a model for the natural language generation task.

Dataset used: e2e_nlg

Task: Train a model to generate descriptions in the restaurant domain from meaning representations, which consists in taking as input some data about a restaurant and generating a sentence in natural language that presents the different aspects of the data about the restaurant.

Code

  • Import the dataset and necessary libraries.

!pip install openai
!pip install datasets
from datasets import load_dataset
import json
import requests
import pandas as pd
import numpy as np
import os
import re
import sys
import os
import openai
import time        

  • Load the e2e_nlg dataset.

dataset = load_dataset("e2e_nlg")        

  • Separate out the train, test, and validation data from the dataset.

X_train = dataset['train']
X_validation = dataset['validation']
X_test = dataset['test']        

  • Initialize the openai API key and organization.

openai.organization = "org-2*****************"
openai.api_key = "sk-R**************************"        

  • Define the file paths and goals.

#define all the data paths here
#path to store the json file created from the train data
train_json_path_exp = "/content/train_data_Exp_GPT3.jsonl"


#path to store the json file created by GPT3
train_json_path_exp_gpt3 = "/content/train_data_Exp_GPT3_prepared.jsonl"


#number of test samples to use because of limit of 60/min API calls
samples_tested = 10


#Amount of training data used due to limitation on tokens (300000)
samples_trained = 500 #because there is a token limit


#Define goal for the GPT3
goal = "Generate descriptions in the restaurant domain from meaning representations."        

  • Create an input file and validate it on GPT3. If required, GPT3 will suggest changes and will update the file and save it in the same location as the CWD.

#start processign the data for training
sentences = [] #contains the entire prompt


#fill the sentences in the input for prompt
for i in range(0, len(X_train)):
  temp_representation = re.sub(r'[^A-Za-z0-9 ]+', '', X_train[i]['meaning_representation'])
  temp_sent = 'Goal: ' + goal + ' \nParagraph: ' + temp_representation + '\n\n###\n\n'
  sentences.append(temp_sent)


#Now we will start preparing the data for trainign our GPT3 model
prompts = []
for i in range(0, len(sentences)):
    adder = {'prompt': sentences[i], 'completion': X_train[i]['human_reference']}
    prompts.append(adder)


#Dump the data to a json file
with open(train_json_path_exp, 'w') as fp:
    json.dump(prompts, fp,  indent=4)


#run the gpt3 command to convert the json file to gpt3 input data
!openai tools fine_tunes.prepare_data -f $train_json_path_exp        

  • Create fine-tuning file of the GPT3 model.

openai.File.create(
  file=open(train_json_path_exp_gpt3),
  purpose='fine-tune'
)        

  • Start fine-tuning of GPT3 using the file-id from the above cell's output.

openai.FineTune.create(training_file="file-9Nmq*****************")        

  • Now that your fine-tuning has started, use the output of the previous cell to find the id of the run and run the below cell. This statement will show you the current state of your model. If the training is complete, it will show the model name and id which can be then used to make predictions on the model.

openai.FineTune.list_events(id="ft-bfq2PK7PYUqPPugd4mwK3AEv")        

  • Now wait till you can see the model name in the output of the previous command. Once fine-tuning is complete and model is saved, use the model name to run the below cell for the prompt you want the model to complete

s_prompt = 'Goal: ' + goal + ' \nParagraph: ' + incomplete_sentence + '\n\n###\n\n
  completion = openai.Completion.create(
    model = "curie:ft-****************",
    prompt=s_prompt
  )'        

The above command will return a JSON that will contain the completion of the input prompt. In the e2e_nlg dataset, the formatted input/output will look something like this

Input:  Goal: Generate descriptions in the restaurant domain from meaning representations.
Paragraph: name[The Vaults], eatType[pub], priceRange[more than £30], customer rating[5 out of 5], near[Cafe Adriatic]

###


Output:   The Vaults pub near Café Adriatic has a 5 star rating.  Prices start at £30.        


The entire source code can be found here: colab_notebook

Limitations

  • The free credit limit for any account is 18$. This is sufficient if you are carrying out research on a small scale or experimenting with GPT3. If you want to run a deep analysis of the model, I doubt if 18$ would be sufficient.

<OpenAIObject list at 0x7f602c2247d0> JSON: {
  "data": [
    {
      "created_at": 1639215448,
      "level": "info",
      "message": "Created fine-tune: ft-pphZx3fnoxMKtwPduAruS3du",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1639215454,
      "level": "error",
      "message": "Fine-tune failed. Fine-tune will exceed billing hard limit",
      "object": "fine-tune-event"
    }
  ],
  "object": "list"
}        

  • One of the major limitations of the APIs is that we can only test one test sample at a time. As you can see below, the prompt parameter cannot be a list or numpy array of inputs. It is of type string and hence you'll have to call the API in a loop if you want to test multiple samples. So for each time you run a query, you'll have to call the OpenAI API.

openai.Completion.create(

    model=FINE_TUNED_MODEL,
    prompt=YOUR_PROMPT
)
         

  • The number of Completion calls per minute is limited to 60.0/min. Hence even in a loop, we cannot call the API at a very high rate. During my experiment, I had to limit the loop to 10 calls.
  • Every model is fine-tuned for 4 epochs. Currently, I could not find anything in the documentation where we can determine the number of epochs when fine-tuning the model.

openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>        

  • You can only fine-tune one model per API key at a time. All other models are queued and will run sequentially once the current run is complete. If you want to fine-tune multiple models, you should create a new API key and run the second model with the new key.

<OpenAIObject list at 0x7f602d483950> JSON: {
  "data": [
    {
      "created_at": 1639216088,
      "level": "info",
      "message": "Created fine-tune: ft-bfq2PK7PYUqPPugd4mwK3AEv",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1639216094,
      "level": "info",
      "message": "Fine-tune costs $32.95",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1639216094,
      "level": "info",
      "message": "Fine-tune enqueued. Queue number: 0",
      "object": "fine-tune-event"
    },
    {
      "created_at": 1639216098,
      "level": "info",
      "message": "Fine-tune started",
      "object": "fine-tune-event"
    }
  ],
  "object": "list"
}        

  • The user has to save the ID of the fine-tuned model and check back once the model is ready after fine-tuning. This is not a major limitation since it is kind of necessary to save the ID to identify each run uniquely.
  • The number of tokens in the training dataset is limited to 300,000 for now. The model training throws an error if more tokens are present in the training file than the allowed limit.

References

  1. https://huggingface.co/datasets/e2e_nlg
  2. https://beta.openai.com/docs/guides/fine-tuning


Thank you for reading....

Let me know what you think...

Martin Bolz

Computer Science Student

1 年

this is pretty awesome! I am going to be using the GPT API to create a cover letter generator bot based on my resume and it will also tell me what talent acquisition teams to sign up for so I don't need to parse the data.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了