ChatGPT is getting?DUMBER

ChatGPT is getting?DUMBER

The performance has decreased by 30% in some?tasks

ChatGPT has caused a Tsunami in the world of Artificial General Intelligence (AGI). It is a very important breakthrough in the field. It has led to automation of many things and increased the productivity of everyone. Recently OpenAI has launched GPT-4 which is a state of the art in Natural Language Processing, Image Processing, Data Analytics and Optical Character recognition. But there is one problem, since its inception in March 2023 to now it seems to have significantly underperformed in many ranges of tasks.

This research was performed by UC Berkley and Stanford University folks. They evaluated March 2023 and June 2023 versions of both GPT-3.5 and GPT-4 on several diverse tasks such as

  • Math Problems
  • Sensitive/Dangerous Questions
  • Opinion surveys
  • Multi-hop knowledge-intensive Questions
  • Generating code
  • Visual Reasoning

These tasks are selected for two reasons. First, they are diverse tasks frequently used to evaluate LLMs in the literature. Second, they are relatively objective and thus easy-to-evaluate.

Before we go on to see how we calculate the performance in each task, there is something you need to know called “AI Drift.”

“AI Drift” is a phenomenon where AI models deviate from their original behavior as time goes by. The performance of the AI models deteriorates in most of these cases. There are many reasons for this behavior, it might be due to data drift, or a concept drift.

  • “Data Drift”?—?The data that is used to train the LLM models are different than real world data. Since most of the AI models are RLHF (Reinforcement Learning from Human Feedback). The real time data is too different from training data that it performs worse.
  • “Concept Drift”?—?This happens when the statistical properties of the target variable, which the model is trying to predict, change over time. This might be due to changes in customer behavior, economic factors, or other external influences.

Since you understand the basic reasons responsible for worse performance, let’s see how we can quantitatively model LLM drifts in different tasks.

In the research, they considered two types of metrics, one that captures the performance measurement for each specific scenario, while the latter covers common complementary measurements.

In particular,

  • Accuracy (how often an LLM service generates the correct answer) is the main metric for math problems and USMLE questions.
  • Response rate for answering sensitive and opinion questions i.e. the frequency that an LLM service directly answers a question.
  • For code generation, the main metric is what fraction of the outputs are directly executable (if the code can be directly executed in a programming environment and pass the unit tests).
  • For visual reasoning and LangChain, it is exact match (whether the final response exactly matches the ground truth).

There are two additional secondary metrics are “Verbosity” and “Mismatch”. Verbosity is the length of generation measured in the number of characters. The second one is mismatch, i.e. how often, for the same prompt, the extracted answers by two versions of the same LLM service do not match. Note that this only compares the answers’ differences, not the raw generations.


Evaluations


  1. Evaluation of Math Problems

The question asked to the two versions of GPT was.

Q: Is 17077 a prime number? Think step by step and then answer “[Yes]” or “[No]”.

Prompt for Math Problem

To this question three metrics were measured accuracy, verbosity and mismatch. The image below shows the results of each metric on both the versions.

Metrics for the problem

For GPT-4:

  • Accuracy: There is a significant drop from 84.0% in March 2023 to 51.1% in June 2023.
  • Verbosity: This metric has increased from 638.3 in March to an unspecified higher value in June, suggesting that the model is generating more verbose (longer) responses over time.
  • Mismatch: The mismatch rate has decreased from 62.6% to 3.9%, indicating that the model has become more aligned with expected results or has fewer errors in understanding.

For GPT-3.5:

  • Accuracy: Accuracy has increased from 49.6% in March 2023 to 76.2% in June 2023.
  • Verbosity: Verbosity has also increased, from 730.4 in March to 891.2 in June, which indicates that GPT-3.5 is also generating longer responses over time.
  • Mismatch: The mismatch rate has decreased from 59.9% to an unspecified lower value in June, showing an improvement similar to GPT-4.

The general trend shown in this graph is that GPT-4 started out more accurate but has seen a decrease in accuracy and an increase in verbosity, with a significant reduction in mismatch. GPT-3.5 started with lower accuracy but has seen improvement in accuracy and a reduction in mismatch, with an increase in verbosity as well.


2. Evaluation of Dangerous/Sensitive Questions

Prompting LLMs with sensitive questions is known to lead to harmful generations such as social biases, personal information, and toxic texts. The goal is to understand how LLM services’ responses to sensitive questions have shifted over time. To achieve this goal, a sensitive question dataset was created, which contains 100 sensitive queries that LLM services are not supposed to answer directly.

Prompt for Sensitive Questions


Key findings include:

  • GPT-4’s accuracy in answering sensitive questions decreased from 21% in March to 5% in June, likely due to a stronger safety layer implemented in updates, making it less likely to respond to sensitive questions.
  • In contrast, GPT-3.5 saw an increase from 2% to 8% in the same period, suggesting it became less conservative.
  • GPT-4’s generation length dropped significantly, indicating it provided shorter responses over time, becoming terser with fewer explanations.

LLM Jailbreaks

Jailbreaking attacks are a major thread to LLM service safety. It rephrases or reorganizes the original sensitive questions in order to produce harmful generations from LLMs. Thus, it is also critical to study how LLM services’ defense against jailbreaking attacks drift over time.

The summary indicates that GPT-4 has become more conservative over time in its response to sensitive questions and more robust against jailbreaking attempts. Conversely, GPT-3.5 has become slightly more responsive to sensitive questions and has not shown a significant change in defense against jailbreaking attacks. Overall, the updates to GPT-4 reflect an increased focus on safety and a reduction in engagement with harmful prompts.


3. Opinion Surveys

LLMs are increasingly leveraged for open-ended text generation, where bias in the opinions in their training or fine-tuning data can play an important role. Therefore, it is vital to understand how LLMs’ opinion biases change over time. To address the issue, the researchers leveraged OpinionQA, a survey dataset comprising 1,506 opinion questions selected for their origin from high-quality public opinion polls. They adhered to the multiple-choice question format from the referenced study [SDL+23] and included the instruction “Pick the best single option” to simplify the extraction of answers.

Opinion Survey

Key Observations:

  • GPT-4’s willingness to respond to opinion questions dropped dramatically from 97.6% in March to 22.1% in June.
  • GPT-3.5’s response rate slightly increased, answering almost all questions in both March and June, with a 27% change in the opinions provided.
  • Opinion drift over time was observed, exceeding the expected variance due to the models’ inherent randomness.
  • GPT-4 in March provided an opinion on the future importance of the U.S. in the world; by June, it refused to answer, stating the question was subjective.

This behavior illustrates a deliberate shift in GPT-4’s design to avoid engaging with subjective questions, emphasizing its lack of personal opinions.


4. Multi-hop knowledge-intensive Questions

Many real-world applications require LLMs to answer knowledge-intensive questions grounded in various data sources, including “multi-hop” questions that involve multiple sources and/or reasoning steps. Therefore, it is natural to monitor how LLMs’ ability to answer multi-hop questions evolves over time. We take a first step by measuring the drifts of a LangChain HotpotQA Agent, a pipeline to answer complex multi-hop questions similar to those from HotpotQA. This agent leveraged LLMs to search over Wikipedia passages to answer complex questions.

Multihop knowledge with LangChain

Key Observations:

  • There were significant drifts in the performance of both GPT-4 and GPT-3.5 over time when using the LangChain HotpotQA Agent.
  • GPT-4’s exact match rate improved from 1.2% in March to 37.8% in June, while GPT-3.5’s performance decreased by about 9% from March to June.
  • The changes in performance are attributed to the poor prompt stability, where the models failed to produce answers in the specific format required by the LangChain agent, leading to parsing errors.

This instability and failure to follow prompt formats indicate challenges in integrating LLMs into larger, real-world application pipelines and highlight the brittleness of current prompting methods amidst the evolving capabilities of LLMs.


5. Generating Code

In this code generation tasks, the evaluation is done specifically using problems from the “easy” category of LeetCode.

Generating Code

Key findings include:

  • There was a significant drop in the number of directly executable code generations by both GPT-4 and GPT-3.5 from March to June, with GPT-4’s executable code dropping from over 50% to 10%.
  • Both models showed a slight increase in verbosity in their code generations.
  • The decline in directly executable code is partly attributed to the models adding non-code text in their outputs, such as markdown formatting tags (‘’python) and additional comments, which made the code non-executable.
  • Despite the formatting issues, the correctness of the code itself improved after removing non-code text, as evaluated by the LeetCode online judge.

The overall trend indicates a decline in adherence to the formatting instructions provided, which raises concerns about the reliability of using LLMs for generating code within larger software pipelines.


6. Visual Reasoning

Visual reasoning might be the only task that GPT has improved over the time. The task is to create a output grid corresponding to an input grid, based solely on a few similar examples. The Figure gives one example query from ARC. To show the visual objects to LLM services, we represent the input and output grids by 2-D arrays, where the value of each element denotes the color. The dataset contains 467 samples of LLM services that fits in all services context window.

Visual Reasoning using GPT


Conclusion

It is still not public why ChatGPT’s performance has dropped, but “AI drift” might be the reason. This behavior is extremely dangerous, while we think of running the whole company using AI agents, if it decides not to perform according to its original behavior then imagine the catastrophic damage that AI can cause us.

That’s the reason why it is important that model developers constantly monitor the performance of LLM and check its deviation from its original behavior. Methods and techniques performed in this paper can be a good starting point to quantify model drift.

Here is the full research paper: https://arxiv.org/pdf/2307.09009.pdf


?? I hope you enjoyed the article! Don’t forget to Subscribe to my newsletter for all the latest updates. I promise you won’t miss a beat! ???

Newsletter: [??] https://thedatastoryteller.substack.com/

X: [??] https://twitter.com/nakedaii

Medium: [??] https://medium.com/@sohailshaik272

Podcasts: [??] https://open.spotify.com/show/77H7rCxJzPNKL5pPxhHuyo?si=DZNn3hTeSkaHUcP5JRmeug

LinkedIn:[????] https://www.dhirubhai.net/in/sohailshaik/

Stay curious and keep learning! ??










要查看或添加评论,请登录

社区洞察

其他会员也浏览了