登录查看更多内容

Watch#5: Enjoying a Free Lunch and Boosting the Math Capabilities of Small LLMs

Pascal Biese

Daily AI highlights for 60k+ experts ???? AI/ML Engineer

发布日期: 2023年10月8日

+ 关注

In this issue:

An (almost) free lunch
Small LLMs doing big Math
1-bit is all you need for Token Representations

1. Efficient Streaming Language Models with Attention Sinks

Watching: StreamingLLM (paper/code/bonus: implementation)

What problem does it solve? There will always be a trade-off between efficiency and raw performance. But even if we accept this as given, there’s still a lot to optimize. Therefore, researchers are constantly trying to minimize the performance losses that come with more efficient model training, decoding and inference.

How does it solve the problem? Previous Attention algorithms were suffering from two problems: poor efficiency and harsh performance cut offs, e.g., on long texts. StreamingLLM solves this with the help of what they call “attention sinks”. These special tokens may seem pretty useless at first as they carry no semantic significance. But the researchers found that autoregressive language models with Window Attention are often giving unreasonably high attention to the initial tokens. They see the Softmax operation as the potential cause of this phenomenon, as it requires the probabilities to sum up to one - even when tokens do not match a query very well. The model then gets biased towards the initial tokens because (almost) all the subsequent tokens are dependant on them. To make this as understandable as possible, you can think of attention sinks as tokens that help the model to maintain focus without losing attention. The paper’s results are showing that even with just a single attention sink token, a “normal” LLM can be turned into a streaming LLM while preserving performance.

What’s next? Evaluation. Evaluation. Evaluation. The method seems almost too good to be true and the researchers haven’t pushed the limits yet in terms of input length and model sizes. But so far, things are looking good and attention sinks might become a new standard for LLMs.

2. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Watching: ToRA (paper/code)

Ajit Jaokar 1 个月前

?? Breaking Compute Barriers

Pascal Biese 9 个月前

Artificial Intelligence #207

Andriy Burkov 10 个月前

What problem does it solve? Math capabilities and advanced reasoning skills have mostly been reserved for either huge closed source models like GPT4 or models finetuned on specialized datasets. General LLMs that run on consumer-grade hardware have made huge steps in terms of question answering and chatting. But not so much in the aforementioned areas.

How does it solve the problem? The two prominent approaches to equipping LLMs with Math and reasoning skills can roughly be divided into “rationale”- and “program”-based methods. Rationale-based methods are trying to elicit logical structures through prompting, e.g., Chain-of Thoughts and Tree-of-Thoughts. Program-based methods, on the other hand, are about making LLMs use code to utilize or create tools that handle these operations for them, e.g., using a calculator or coding a function to deal with logic problems. Tool-integrated Reasoning Agent (ToRA) is a framework that combines both approaches and merges them into a single workflow. ToRA starts with reasoning about the query and how to solve it with code. Then it creates a tool to execute its reasoning and finally, it reasons again about the tool’s output. This can include fixes to the output format, such as removing duplicates or checking if a requested requirement was met (e.g., JSON).

What’s next? Theres no end in sight when it comes to prompting techniques and tool usage improvements. While combining both seems like a logical step, we’ll most likely continue to see isolated advancements on both fronts.

3. BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models

Watching: Binary Token Representations (paper/code)

What problem does it solve? Companies are always looking to cut costs and as LLM applications are slowly finding their way into production, requirements are increasing with scale. There’s a lot of redundant computation taking place when querying LLMs - especially Chat models - and while the storage space of vector embeddings probably seems like a non-issue to most practitioners, things can look very differently if the scale is large enough. Last but not least, inference speed requirements can be tough to meet without paying a lot of extra dollars.

How does it solve the problem? Binary Token Representations (BTR) use 1-bit vectors to precompte every token in a document with the goal to reduce computation at inference time significantly. The researchers also published a calibration technique that they used for the binary representations to ensure that performance is maintained. This lead to ~95% performance at 2-3x inference speed in their tests.

What’s next? Binarization and Quantization have become trending topics over the last months and new methods are evolving constantly. We’ve seen massive efficiency improvements in very little time and at the beginning of this year, few would’ve thought that we could fit 7B or even 13B LLMs into a free-tier Google Colab GPU. It’s hard to tell how far we will be able to continue down this path from here on, but at least for quantization, it seems to me that we’ll be reaching a local optimum soon. Recent papers have shown 2-bit quantization to work somewhat well and 4-bit quantization has already become a standard procedure for people in the “GPU-poor” LLM space.

Watch#5: Enjoying a Free Lunch and Boosting the Math Capabilities of Small LLMs

Pascal Biese

Daily AI highlights for 60k+ experts ???? AI/ML Engineer

In this issue:

1. Efficient Streaming Language Models with Attention Sinks

2. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

领英推荐

3. BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models

Papers of the Week:

LLM Watch

49,336 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

??Top ML Papers of the Week

??Top ML Papers of the Week

???? The Next Impact Factor

??Top ML Papers of the Week

Artificial Intelligence #106

?? How to Expand LLMs Memory

o1-Preview?—?Everything You Need to Know About OpenAI’s New Model in 2024

Top LLM Papers of the Week (October Week 1, 2024)

LLM Paper Reading Notes - May 2024

AI Foundation: Creating a small Language Model (LLM) for a lab exercise

In this issue:

1. Efficient Streaming Language Models with Attention Sinks

2. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

领英推荐

3. BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models

Papers of the Week:

LLM Watch

49,336 位关注者

?? Actually Open AI: A Free o1 Alternative

2024年11月22日

?? The Future of Designing AI Agents

2024年11月15日

?? HTML > Plain Text for RAG

2024年11月8日

?? All You Need to Know About Small Language Models

2024年11月1日

?? Is AI Capable of Reflection?

2024年10月25日

??? GraphRAG Evolves into StructRAG

2024年10月18日

?? Fixing AI's Energy Consumption

2024年10月11日

?? Chasing o1: Closing the Reasoning Gap

2024年10月4日

?? LLMs Are Improving Themselves

2024年9月27日

?? A New Neural Architecture (Again)

2024年9月20日

社区洞察

其他会员也浏览了

??Top ML Papers of the Week

??Top ML Papers of the Week

???? The Next Impact Factor

??Top ML Papers of the Week

Artificial Intelligence #106

?? How to Expand LLMs Memory

o1-Preview?—?Everything You Need to Know About OpenAI’s New Model in 2024

Top LLM Papers of the Week (October Week 1, 2024)

LLM Paper Reading Notes - May 2024

AI Foundation: Creating a small Language Model (LLM) for a lab exercise