How to measure language model performance
Welcome to Continual Learnings
A weekly newsletter for practitioners building ML-powered products.
What we're reading this week
Evaluating generative models like LLMs is notoriously difficult because it’s hard to tell which outputs are better without the help of humans. Recently, the research community has explored training?auxiliary?models to assess the performance of these hard-to-evaluate generative models. This paper demonstrates that this approach works for code generation models
Libraries like?Langchain?are having a moment. Their purpose is to make it easy to compose language models, vector similarity search, and other operations into language model apps. Promptable is a Langchain-like library for the Javascript ecosystem. That’s exciting because you don’t need ML expertise to build applications with this stack, so removing the dependence on python should let more people create AI apps.
Reinforcement learning from human feedback (RLHF) has captured the attention of the ML community because of its role in ChatGPT. However, RLHF is difficult to implement: it requires training an auxiliary model and using a notoriously finicky heavyweight RL algorithm like PPO. This paper shows that, while human feedback is incredibly valuable, the RL part might not be necessary. You just need to be clever about transforming the feedback into a signal that can be used for supervised learning.
This article points out that “native” feedback loop, i.e., one where users give feedback implicitly via outcomes like click data, is more valuable than a feedback loop that requires users to explicitly provide feedback like thumbs up / thumbs down
Hamel Husain’s notes might clarify some of what you find confusing about model serving.
This week in “does twitter imitate papers or do papers imitate twitter”, researchers discovered a phenomenon that the LLM community online has known for a while: you can give LLMs access to tools (e.g., APIs) that they can access to solve problems like arithmetic that they are not naturally good at. In all seriousness, this line of work seems like one of the paths to much broader usefulness of language models across tasks.?
Ok, I’m teaching this one, not reading it. Just like with the original Full Stack Deep Learning back in 2018, we realized that there’s a huge body of knowledge about how to build products with LLMs that is currently being passed from practitioner to practitioner through twitter threads and newsletters like this one. This course is our first attempt to formalize this into a guide to building applications with this exciting new stack.
Production ML papers to know
In this series, we cover important papers to know if you build ML-powered product.
Holistic Evaluation of Language Models
You probably feel like Language Models are advancing at a stunning pace.
But how do we know they really are? And how can we quantify how much better the latest-and-greatest (e.g., GPT-3) is than a less expensive alternative, and how much those differences will matter in the real world?
Today’s?paper?proposes an approach that might help.
The challenge
Language models (LMs) are becoming ubiquitous in the post-ChatGPT world. But how well do you really understand how the latest models perform? Sure, they have impressive few-shot capabilities and suffer from a tendency to hallucinate. But we’re?MLEs, we should be able to quantify that, right?
Typically, researchers assess LMs on a limited subset of their possible applications using a single metric like accuracy. These benchmarks aren’t standardized, making performance comparison hard.
领英推荐
The example of ImageNet showed that AI research benefits from having a generally accepted standard benchmark. Without an equivalent in the LM world, knowledge of the pros and cons of different models is disseminated through word-of-mouth and random twitter threads.
The solution
The main challenge in evaluating LMs is that they are adapted to many different scenarios. This calls for a holistic approach to evaluation.
To that end, the paper proposes HELM (Holistic Evaluation of Language Models), which is based on the following pillars:
Results
So what is the impact of standardizing LM evaluation?
The paper reports that, prior to HELM, models were on average evaluated on only 17.9% of the core scenarios - meaning there was no way of comparing results, and no guarantee that the models were assessed against all the scenarios they might be deployed in, nor against the relevant metrics.?
HELM evaluated 30 prominent language models to improve this coverage to 96%, facilitating direct, head-to-head comparisons. The chart below illustrates the increase in coverage offered by HELM to previous work in the field.
This benchmark used few-shot prompting with relatively simple, generic prompts.
In addition, HELM introduces a taxonomy to understand the evaluation of future LMs. For example, how many of the core scenarios and metrics have been used for a LM’s evaluation? What is missing, and what risks does this incur for future use of that LM?
The chart below, taken from the paper, shows the structure of the taxonomy, broken down into Scenarios and Metrics.
Finally, by evaluating a large number of models, the paper validates some of what we know anecdotally about LM performance.
For example, instruction-tuned models tend to perform better than model types. So do ‘non-open’ models over open access ones. There were consistent performance disparities for different demographics across models, while all models also showed significant sensitivity to prompts, particularly to the formatting of the prompt, and to the choice and number of in-context examples
The upshot
Hopefully this paper will lead to some much-needed standardization in LM assessment.
The paper is 90 pages long prior to references, and as such contains much more detail than we covered here.
If you are LM developer, or in any way interested in how the technological impact of AI on society can be better evaluated, then we recommend taking a look. You can find the paper?here.
Thanks for reading!
Feel free to get in touch if you have any questions: you can message us on socials or simply reply to this email.
You can also find previous issues on?our blog,?on?twitter?and here?on LinkedIn.
The Gantry team