BitNet a4.8

BitNet a4.8

A quantized LLM, BitNet a4.8, reduces parameter requirements by using 4-bit activations and sparsity specifically tailored for 1-bit LLMs. It supports fast, large-scale deployment and operates on smaller data sizes without a major drop in performance.

Here's how it works:

BitNet a4.8 combines techniques to balance precision and efficiency by using 4-bit activations for certain model layers and reducing unnecessary data in intermediate steps, which are then stored in 8-bit format.

Image credit: Original paper

BitLinear layers for weight quantization: Both attention and feed-forward network (FFN) layers in BitNet use BitLinear, which allows them to work with very low-bit weights (1.58-bit weights). They store less information per weight, making the model lighter and faster.

Hybrid quantization and sparsification: Instead of using the same low-bit format everywhere, BitNet a4.8 uses a mix of quantization (compressing data to fewer bits) and sparsification (ignoring certain smaller values to save computation). This helps manage larger values.

Image credit: Original paper

Using 8-bit sparsification for certain layers: Some layers, like the outputs from attention layers, have a lot of small values near zero. BitNet uses an 8-bit sparsification method for them to keep important information while reducing the computational load.

Selective quantization with masks: BitNet selectively picks the most important parts (top values) of certain data. This selective quantization is like giving special attention to only the “biggest” numbers while ignoring smaller ones.

4-bit quantization for input layers: For the input layers to attention and FFN, which don’t have as many outliers, BitNet uses a simpler 4-bit quantization method. This keeps the model efficient while still handling the data accurately.

Image credit: Original paper

ReLU usage for activation sparsity: Squared ReLU is used in the FFN layers, which makes many of the values turn into zeros (over 80% sparsity). This reduction leads to fewer calculations.

Gated Linear Unit (GLU) for activation sparsity: For the gate projection, BitNet sees similar benefits, with over 67% of values being zero, allowing it to perform calculations only on the meaningful parts.

Results:

BitNet a4.8 performs just as well as BitNet b1.58 version, but it's faster. It uses only ~55% of its parameters and includes a 3-bit cache, making it even more efficient for large-scale LLMs.

Image credit: Original paper

Paper: https://arxiv.org/pdf/2411.04965

https://thegenerality.com/agi/

要查看或添加评论,请登录

TuringPost的更多文章

社区洞察

其他会员也浏览了