Time to conduct 100 iterations, in about 60 seconds.
Lumina - makers of Analytica
Lumina developed the Analytica software, a quantitative decision support modeling environment.
If you are like most people, you have probably had many conversations with other people about ChatGPT, and during those conversations have made claims about certain questions that GPT successfully or unsuccessfully answers.
Furthermore, the evidence for your claims were probably anecdotal, based only on one or two tries. After all, it would take a lot of effort to repeat a question 100 times, in 100 different chat sessions, from the ChatGPT interface.
Using the Analytica OpenAI API library, it is trivially easy to run a query 100 (or more) times with GPT-4, GPT-3.5-turbo, or GPT-3.5-instruct, and measure how frequently it succeeds or fails.
Last week's AI Serie's post did this and provided hard statistics on how frequently GPT-4 succeeded or failed on specific questions. In addition, the amount of computation time it takes to get the 100 answers is roughly the same amount of time to get a single answer.
All you have to do is
1. Add the OpenAI API library to your Analytica model (see OpenAI API library reference).
2. Create an index, Repeat, defined as 1..100
3. Create a variable to hold the responses, and define it as
领英推荐
Prompt_completion( prompt, Completion_index:Repeat, modelName: 'gpt-4' )
Where ?prompt? is the question.
That's all it takes.
Total time to conduct 100 iterations, including carrying out the above steps AND running it: about 60 seconds.
If eye-balling the responses is not enough, you can then write an Analytica expression to score correctness.
In more complex cases, such as when the responses are a free form format and it is harder to code an expression to score correctness, you can use a second call to Prompt_completion(...) to have the language model itself determine whether each answer is correct.
Next time you claim instead of making an anecdotal claim, you'll be ready to report?"GPT solves this problem 87% of the time".