登录查看更多内容

? When Accuracy Isn't Enough - Don't Make This Mistake

Pascal Biese

Daily AI highlights for 60k+ experts ???? AI/ML Engineer

发布日期: 2024年7月19日

+ 关注

In this issue:

Accuracy is not all you need
Teaching LLMs inductive reasoning
When your LLM just doesn’t know enough words

Subscribe now

1. Accuracy is Not All You Need

Watching: Model Quantization (paper )

What problem does it solve? Compressing Large Language Models (LLMs) is crucial for their practical deployment, as it reduces computational costs and memory requirements. However, the current evaluation methods for compressed models primarily rely on accuracy metrics, which may not capture the full extent of quality degradation. This study highlights the need for more comprehensive evaluation metrics to assess the performance of compressed LLMs accurately.

How does it solve the problem? The researchers propose two new metrics to evaluate compressed LLMs: KL-Divergence and flips. KL-Divergence measures the difference in probability distributions between the baseline and compressed models, providing a more nuanced understanding of how the models' outputs differ. The flips metric quantifies the proportion of answers that change from correct to incorrect (and vice versa) between the baseline and compressed models, even when overall accuracy remains similar. By incorporating these metrics, the study offers a more comprehensive evaluation framework for compressed LLMs.

What's next? As the field of LLM compression continues to evolve, it is essential to adopt a more holistic approach to evaluating compressed models. Future research should focus on developing and refining metrics that capture various aspects of model performance, such as generalization, robustness, and consistency. By addressing these challenges, we can ensure that compressed LLMs maintain high quality while being more accessible and efficient for real-world applications.

2. Case2Code: Learning Inductive Reasoning with Synthetic Data

Watching: Case2Code (paper )

Danny Butvinik 1 年前

??Top ML Papers of the Week

DAIR.AI 5 个月前

Watch#6: LLMs 4 Science and How to Keep Your Models…

Pascal Biese 1 年前

What problem does it solve? Inductive reasoning, the ability to infer underlying rules by observing examples or sequential transformations, is a crucial aspect of complex reasoning. While Large Language Models (LLMs) have shown impressive deductive reasoning skills, their inductive reasoning capabilities have not been extensively evaluated or explicitly trained. Collecting large-scale, diverse human-generated inductive data is challenging, making it difficult to assess and enhance LLMs' inductive reasoning abilities.

How does it solve the problem? The researchers propose a novel approach called Case2Code, which leverages the expressiveness and correctness of programs to synthesize inductive reasoning tasks. They collect a diverse set of executable programs and generate input-output transformations for each program. LLMs are then tasked with inferring the underlying code implementations based on the synthetic input-output cases. By evaluating representative LLMs on the Case2Code task, the researchers demonstrate that case-to-code induction is challenging for current models. To address this, they synthesize large-scale Case2Code training samples to explicitly train LLMs in inductive reasoning.

What's next? The Case2Code approach shows promise in enhancing the inductive reasoning capabilities of LLMs. The researchers demonstrate that training on synthetic Case2Code data not only improves performance on the Case2Code task itself but also benefits various coding abilities of the trained LLMs. This suggests that learning inductive reasoning through synthetic data has great potential. Future work could explore expanding the Case2Code approach to other domains beyond coding, as well as investigating the transfer of inductive reasoning skills learned through Case2Code to real-world tasks requiring complex reasoning.

3. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Watching: LLM Vocabulary (paper )

What problem does it solve? While a lot of research has been done on scaling laws for LLMs, most of it has been focused on the number of parameters and the amount of training data. The vocabulary size, which determines the granularity of the tokens that are used to represent the input and output sequences, has been largely overlooked. Choosing the right vocabulary size is a trade-off between representing the input and output more efficiently with fewer tokens and the risk of under-fitting rare tokens.

How does it solve the problem? The researchers propose three different methods for predicting the optimal vocabulary size for a given compute budget: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. All three methods converge on the same result, showing that the optimal vocabulary size depends on the available compute budget and that larger models should use larger vocabularies. For example, they predict that the Llama2-70B model should have used a vocabulary size of at least 216K instead of the 32K that was actually used.

What's next? The findings of this research could have a significant impact on the design of future LLMs. By jointly considering the number of parameters and the vocabulary size, it may be possible to train more efficient models that achieve better performance with the same compute budget. Further research will be needed to fully understand the trade-offs involved and to develop practical guidelines for choosing the optimal vocabulary size for a given application.

Papers of the Week:

LLM Watch

49,089 位关注者

Surya Putchala

Applied AI/ML Expert | I help organizations from AI Strategy & Solutioning to Execution | Generative AI Consultant | 2X Founder, 2 Exits with $40MM+ M&A valuation

4 个月

Interesting paper! Any one metric of evaluation fails. To me, the metrics depend on the use case!

2 次回应

John K. Moran

SaaS Data Integration & Analytics Expert | Empowering Business Growth Through Custom Data Solutions

4 个月

Relying solely on accuracy can be misleading when evaluating compressed LLMs. The introduction of KL-Divergence and flips as metrics is a valuable step towards a more holistic evaluation process.

3 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

? When Accuracy Isn't Enough - Don't Make This Mistake

Pascal Biese

Daily AI highlights for 60k+ experts ???? AI/ML Engineer

In this issue:

1. Accuracy is Not All You Need

2. Case2Code: Learning Inductive Reasoning with Synthetic Data

领英推荐

3. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Papers of the Week:

LLM Watch

49,089 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

The LLM Inc

Evaluating LLM and RAG Systems

Are Long-LLMs A Necessity For Long-Context Tasks?

Large Language Models - part 2

Steps to Build a Large Language Model (LLM)

Embedding Entire Graphs or Sub-Graphs: Part 7 of X of my notes

LLM FINE-TUNING STRATEGIES FOR DOMAIN-SPECIFIC APPLICATIONS - A DEEP DIVE

Paper Review: Think before you speak: Training Language Models With Pause Tokens

Paper Review: Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

Mastering the Art of Prompting LLMs: Learn How to Do it Right

In this issue:

1. Accuracy is Not All You Need

2. Case2Code: Learning Inductive Reasoning with Synthetic Data

领英推荐

3. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Papers of the Week:

LLM Watch

49,089 位关注者

?? Actually Open AI: A Free o1 Alternative

2024年11月22日

?? The Future of Designing AI Agents

2024年11月15日

?? HTML > Plain Text for RAG

2024年11月8日

?? All You Need to Know About Small Language Models

2024年11月1日

?? Is AI Capable of Reflection?

2024年10月25日

??? GraphRAG Evolves into StructRAG

2024年10月18日

?? Fixing AI's Energy Consumption

2024年10月11日

?? Chasing o1: Closing the Reasoning Gap

2024年10月4日

?? LLMs Are Improving Themselves

2024年9月27日

?? A New Neural Architecture (Again)

2024年9月20日

社区洞察

其他会员也浏览了

The LLM Inc

Evaluating LLM and RAG Systems

Are Long-LLMs A Necessity For Long-Context Tasks?

Large Language Models - part 2

Steps to Build a Large Language Model (LLM)

Embedding Entire Graphs or Sub-Graphs: Part 7 of X of my notes

LLM FINE-TUNING STRATEGIES FOR DOMAIN-SPECIFIC APPLICATIONS - A DEEP DIVE

Paper Review: Think before you speak: Training Language Models With Pause Tokens

Paper Review: Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

Mastering the Art of Prompting LLMs: Learn How to Do it Right