ChatGPT is getting?DUMBER
Sohail Shaik
Senior Data Scientist | LLM | AWS | TensorFlow | Machine Learning | AI | GPT | Llama | LangChain | Neural Networks | Spark | Kafka | PySpark | NLP | Python | DynamoDB | PostgreSQL | Zeppelin | SQL | Neo4j | RAG systems
The performance has decreased by 30% in some?tasks
ChatGPT has caused a Tsunami in the world of Artificial General Intelligence (AGI). It is a very important breakthrough in the field. It has led to automation of many things and increased the productivity of everyone. Recently OpenAI has launched GPT-4 which is a state of the art in Natural Language Processing, Image Processing, Data Analytics and Optical Character recognition. But there is one problem, since its inception in March 2023 to now it seems to have significantly underperformed in many ranges of tasks.
This research was performed by UC Berkley and Stanford University folks. They evaluated March 2023 and June 2023 versions of both GPT-3.5 and GPT-4 on several diverse tasks such as
These tasks are selected for two reasons. First, they are diverse tasks frequently used to evaluate LLMs in the literature. Second, they are relatively objective and thus easy-to-evaluate.
Before we go on to see how we calculate the performance in each task, there is something you need to know called “AI Drift.”
“AI Drift” is a phenomenon where AI models deviate from their original behavior as time goes by. The performance of the AI models deteriorates in most of these cases. There are many reasons for this behavior, it might be due to data drift, or a concept drift.
Since you understand the basic reasons responsible for worse performance, let’s see how we can quantitatively model LLM drifts in different tasks.
In the research, they considered two types of metrics, one that captures the performance measurement for each specific scenario, while the latter covers common complementary measurements.
In particular,
There are two additional secondary metrics are “Verbosity” and “Mismatch”. Verbosity is the length of generation measured in the number of characters. The second one is mismatch, i.e. how often, for the same prompt, the extracted answers by two versions of the same LLM service do not match. Note that this only compares the answers’ differences, not the raw generations.
Evaluations
The question asked to the two versions of GPT was.
Q: Is 17077 a prime number? Think step by step and then answer “[Yes]” or “[No]”.
To this question three metrics were measured accuracy, verbosity and mismatch. The image below shows the results of each metric on both the versions.
For GPT-4:
For GPT-3.5:
The general trend shown in this graph is that GPT-4 started out more accurate but has seen a decrease in accuracy and an increase in verbosity, with a significant reduction in mismatch. GPT-3.5 started with lower accuracy but has seen improvement in accuracy and a reduction in mismatch, with an increase in verbosity as well.
2. Evaluation of Dangerous/Sensitive Questions
Prompting LLMs with sensitive questions is known to lead to harmful generations such as social biases, personal information, and toxic texts. The goal is to understand how LLM services’ responses to sensitive questions have shifted over time. To achieve this goal, a sensitive question dataset was created, which contains 100 sensitive queries that LLM services are not supposed to answer directly.
Key findings include:
LLM Jailbreaks
Jailbreaking attacks are a major thread to LLM service safety. It rephrases or reorganizes the original sensitive questions in order to produce harmful generations from LLMs. Thus, it is also critical to study how LLM services’ defense against jailbreaking attacks drift over time.
The summary indicates that GPT-4 has become more conservative over time in its response to sensitive questions and more robust against jailbreaking attempts. Conversely, GPT-3.5 has become slightly more responsive to sensitive questions and has not shown a significant change in defense against jailbreaking attacks. Overall, the updates to GPT-4 reflect an increased focus on safety and a reduction in engagement with harmful prompts.
3. Opinion Surveys
LLMs are increasingly leveraged for open-ended text generation, where bias in the opinions in their training or fine-tuning data can play an important role. Therefore, it is vital to understand how LLMs’ opinion biases change over time. To address the issue, the researchers leveraged OpinionQA, a survey dataset comprising 1,506 opinion questions selected for their origin from high-quality public opinion polls. They adhered to the multiple-choice question format from the referenced study [SDL+23] and included the instruction “Pick the best single option” to simplify the extraction of answers.
领英推荐
Key Observations:
This behavior illustrates a deliberate shift in GPT-4’s design to avoid engaging with subjective questions, emphasizing its lack of personal opinions.
4. Multi-hop knowledge-intensive Questions
Many real-world applications require LLMs to answer knowledge-intensive questions grounded in various data sources, including “multi-hop” questions that involve multiple sources and/or reasoning steps. Therefore, it is natural to monitor how LLMs’ ability to answer multi-hop questions evolves over time. We take a first step by measuring the drifts of a LangChain HotpotQA Agent, a pipeline to answer complex multi-hop questions similar to those from HotpotQA. This agent leveraged LLMs to search over Wikipedia passages to answer complex questions.
Key Observations:
This instability and failure to follow prompt formats indicate challenges in integrating LLMs into larger, real-world application pipelines and highlight the brittleness of current prompting methods amidst the evolving capabilities of LLMs.
5. Generating Code
In this code generation tasks, the evaluation is done specifically using problems from the “easy” category of LeetCode.
Key findings include:
The overall trend indicates a decline in adherence to the formatting instructions provided, which raises concerns about the reliability of using LLMs for generating code within larger software pipelines.
6. Visual Reasoning
Visual reasoning might be the only task that GPT has improved over the time. The task is to create a output grid corresponding to an input grid, based solely on a few similar examples. The Figure gives one example query from ARC. To show the visual objects to LLM services, we represent the input and output grids by 2-D arrays, where the value of each element denotes the color. The dataset contains 467 samples of LLM services that fits in all services context window.
Conclusion
It is still not public why ChatGPT’s performance has dropped, but “AI drift” might be the reason. This behavior is extremely dangerous, while we think of running the whole company using AI agents, if it decides not to perform according to its original behavior then imagine the catastrophic damage that AI can cause us.
That’s the reason why it is important that model developers constantly monitor the performance of LLM and check its deviation from its original behavior. Methods and techniques performed in this paper can be a good starting point to quantify model drift.
Here is the full research paper: https://arxiv.org/pdf/2307.09009.pdf
?? I hope you enjoyed the article! Don’t forget to Subscribe to my newsletter for all the latest updates. I promise you won’t miss a beat! ???
Newsletter: [??] https://thedatastoryteller.substack.com/
X: [??] https://twitter.com/nakedaii
Medium: [??] https://medium.com/@sohailshaik272
LinkedIn:[????] https://www.dhirubhai.net/in/sohailshaik/
Stay curious and keep learning! ??