Watch#5: Enjoying a Free Lunch and Boosting the Math Capabilities of Small LLMs
In this issue:
1. Efficient Streaming Language Models with Attention Sinks
Watching: StreamingLLM (paper/code/bonus: implementation)
What problem does it solve? There will always be a trade-off between efficiency and raw performance. But even if we accept this as given, there’s still a lot to optimize. Therefore, researchers are constantly trying to minimize the performance losses that come with more efficient model training, decoding and inference.
How does it solve the problem? Previous Attention algorithms were suffering from two problems: poor efficiency and harsh performance cut offs, e.g., on long texts. StreamingLLM solves this with the help of what they call “attention sinks”. These special tokens may seem pretty useless at first as they carry no semantic significance. But the researchers found that autoregressive language models with Window Attention are often giving unreasonably high attention to the initial tokens. They see the Softmax operation as the potential cause of this phenomenon, as it requires the probabilities to sum up to one - even when tokens do not match a query very well. The model then gets biased towards the initial tokens because (almost) all the subsequent tokens are dependant on them. To make this as understandable as possible, you can think of attention sinks as tokens that help the model to maintain focus without losing attention. The paper’s results are showing that even with just a single attention sink token, a “normal” LLM can be turned into a streaming LLM while preserving performance.
What’s next? Evaluation. Evaluation. Evaluation. The method seems almost too good to be true and the researchers haven’t pushed the limits yet in terms of input length and model sizes. But so far, things are looking good and attention sinks might become a new standard for LLMs.
2. ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
领英推荐
What problem does it solve? Math capabilities and advanced reasoning skills have mostly been reserved for either huge closed source models like GPT4 or models finetuned on specialized datasets. General LLMs that run on consumer-grade hardware have made huge steps in terms of question answering and chatting. But not so much in the aforementioned areas.
How does it solve the problem? The two prominent approaches to equipping LLMs with Math and reasoning skills can roughly be divided into “rationale”- and “program”-based methods. Rationale-based methods are trying to elicit logical structures through prompting, e.g., Chain-of Thoughts and Tree-of-Thoughts. Program-based methods, on the other hand, are about making LLMs use code to utilize or create tools that handle these operations for them, e.g., using a calculator or coding a function to deal with logic problems. Tool-integrated Reasoning Agent (ToRA) is a framework that combines both approaches and merges them into a single workflow. ToRA starts with reasoning about the query and how to solve it with code. Then it creates a tool to execute its reasoning and finally, it reasons again about the tool’s output. This can include fixes to the output format, such as removing duplicates or checking if a requested requirement was met (e.g., JSON).
What’s next? Theres no end in sight when it comes to prompting techniques and tool usage improvements. While combining both seems like a logical step, we’ll most likely continue to see isolated advancements on both fronts.
3. BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models
What problem does it solve? Companies are always looking to cut costs and as LLM applications are slowly finding their way into production, requirements are increasing with scale. There’s a lot of redundant computation taking place when querying LLMs - especially Chat models - and while the storage space of vector embeddings probably seems like a non-issue to most practitioners, things can look very differently if the scale is large enough. Last but not least, inference speed requirements can be tough to meet without paying a lot of extra dollars.
How does it solve the problem? Binary Token Representations (BTR) use 1-bit vectors to precompte every token in a document with the goal to reduce computation at inference time significantly. The researchers also published a calibration technique that they used for the binary representations to ensure that performance is maintained. This lead to ~95% performance at 2-3x inference speed in their tests.
What’s next? Binarization and Quantization have become trending topics over the last months and new methods are evolving constantly. We’ve seen massive efficiency improvements in very little time and at the beginning of this year, few would’ve thought that we could fit 7B or even 13B LLMs into a free-tier Google Colab GPU. It’s hard to tell how far we will be able to continue down this path from here on, but at least for quantization, it seems to me that we’ll be reaching a local optimum soon. Recent papers have shown 2-bit quantization to work somewhat well and 4-bit quantization has already become a standard procedure for people in the “GPU-poor” LLM space.
Papers of the Week: