登录查看更多内容

BitNet a4.8

TuringPost

Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??

发布日期: 2024年11月14日

A quantized LLM, BitNet a4.8, reduces parameter requirements by using 4-bit activations and sparsity specifically tailored for 1-bit LLMs. It supports fast, large-scale deployment and operates on smaller data sizes without a major drop in performance.

Here's how it works:

BitNet a4.8 combines techniques to balance precision and efficiency by using 4-bit activations for certain model layers and reducing unnecessary data in intermediate steps, which are then stored in 8-bit format.

BitLinear layers for weight quantization: Both attention and feed-forward network (FFN) layers in BitNet use BitLinear, which allows them to work with very low-bit weights (1.58-bit weights). They store less information per weight, making the model lighter and faster.

Hybrid quantization and sparsification: Instead of using the same low-bit format everywhere, BitNet a4.8 uses a mix of quantization (compressing data to fewer bits) and sparsification (ignoring certain smaller values to save computation). This helps manage larger values.

Using 8-bit sparsification for certain layers: Some layers, like the outputs from attention layers, have a lot of small values near zero. BitNet uses an 8-bit sparsification method for them to keep important information while reducing the computational load.

Selective quantization with masks: BitNet selectively picks the most important parts (top values) of certain data. This selective quantization is like giving special attention to only the “biggest” numbers while ignoring smaller ones.

4-bit quantization for input layers: For the input layers to attention and FFN, which don’t have as many outliers, BitNet uses a simpler 4-bit quantization method. This keeps the model efficient while still handling the data accurately.

领英推荐

Notes on Data Compression: Part 1

Simon Southwell 3 年前

Synchronization with Atomics in C++20

Rainer Grimm 4 年前

Fixed-Latency Models

Stefan Schlamp 11 个月前

ReLU usage for activation sparsity: Squared ReLU is used in the FFN layers, which makes many of the values turn into zeros (over 80% sparsity). This reduction leads to fewer calculations.

Gated Linear Unit (GLU) for activation sparsity: For the gate projection, BitNet sees similar benefits, with over 67% of values being zero, allowing it to perform calculations only on the meaningful parts.

Results:

BitNet a4.8 performs just as well as BitNet b1.58 version, but it's faster. It uses only ~55% of its parameters and includes a 3-bit cache, making it even more efficient for large-scale LLMs.

Paper: https://arxiv.org/pdf/2411.04965

https://thegenerality.com/agi/

BitNet a4.8

TuringPost

Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??

领英推荐

Turing Post

2,425 位关注者

TuringPost的更多文章

社区洞察

其他会员也浏览了

What should we do if we have a case where we have data of what we must detect and negligible or zero data in case of what we must not detect ?

Advanced DBC File Format Concepts

Get Insight from your Business Data - Build LLM application with PEFT (with LoRA) using ?? Hugging Face

Techniques to reduce overfitting

Problem-Solving Technique?—?Sliding Window

Why Multidimensional Scaling Fails?

Using RAG? Slow Searches? Let’s Try the Faster Way... But with a Catch

How to use SAHI to detect smaller objects...!

Linked Lists: The Coolest Data Structure You Need to Know!

Time, Clocks, and the Ordering of Events in a Distributed System

领英推荐

Turing Post

2,425 位关注者

TuringPost的更多文章

????#15: Humans as Tools? The Surprising Evolution of HITL in Agentic Workflows

????? TP/Inference: Sharon Zhou on AI Hallucinations, Agents Hype, and Giving Developers the Keys to GenAI

Topic 33: Slim Attention, KArAt, and XAttention Explained – What’s Really Changing in Transformers?

FOD#93: When AI meant Ambient Intelligence

What is Qwen-Agent framework? Inside the Qwen family

??#92: Fight for Developers and the Year of Orchestration

????#14: What Is MCP, and Why Is Everyone – Suddenly!– Talking About It?

Topic 31: How to Reduce Memory Use in Reasoning Models

??#91: We are failing in AI literacy

????#13: Action! How AI Agents Execute Tasks with UI and API Tools

社区洞察

其他会员也浏览了

What should we do if we have a case where we have data of what we must detect and negligible or zero data in case of what we must not detect ?

Advanced DBC File Format Concepts

Get Insight from your Business Data - Build LLM application with PEFT (with LoRA) using ?? Hugging Face

Techniques to reduce overfitting

Problem-Solving Technique?—?Sliding Window

Why Multidimensional Scaling Fails?

Using RAG? Slow Searches? Let’s Try the Faster Way... But with a Catch

How to use SAHI to detect smaller objects...!

Linked Lists: The Coolest Data Structure You Need to Know!

Time, Clocks, and the Ordering of Events in a Distributed System