BitNet a4.8
TuringPost
Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??
A quantized LLM, BitNet a4.8, reduces parameter requirements by using 4-bit activations and sparsity specifically tailored for 1-bit LLMs. It supports fast, large-scale deployment and operates on smaller data sizes without a major drop in performance.
Here's how it works:
BitNet a4.8 combines techniques to balance precision and efficiency by using 4-bit activations for certain model layers and reducing unnecessary data in intermediate steps, which are then stored in 8-bit format.
BitLinear layers for weight quantization: Both attention and feed-forward network (FFN) layers in BitNet use BitLinear, which allows them to work with very low-bit weights (1.58-bit weights). They store less information per weight, making the model lighter and faster.
Hybrid quantization and sparsification: Instead of using the same low-bit format everywhere, BitNet a4.8 uses a mix of quantization (compressing data to fewer bits) and sparsification (ignoring certain smaller values to save computation). This helps manage larger values.
Using 8-bit sparsification for certain layers: Some layers, like the outputs from attention layers, have a lot of small values near zero. BitNet uses an 8-bit sparsification method for them to keep important information while reducing the computational load.
Selective quantization with masks: BitNet selectively picks the most important parts (top values) of certain data. This selective quantization is like giving special attention to only the “biggest” numbers while ignoring smaller ones.
4-bit quantization for input layers: For the input layers to attention and FFN, which don’t have as many outliers, BitNet uses a simpler 4-bit quantization method. This keeps the model efficient while still handling the data accurately.
ReLU usage for activation sparsity: Squared ReLU is used in the FFN layers, which makes many of the values turn into zeros (over 80% sparsity). This reduction leads to fewer calculations.
Gated Linear Unit (GLU) for activation sparsity: For the gate projection, BitNet sees similar benefits, with over 67% of values being zero, allowing it to perform calculations only on the meaningful parts.
Results:
BitNet a4.8 performs just as well as BitNet b1.58 version, but it's faster. It uses only ~55% of its parameters and includes a 3-bit cache, making it even more efficient for large-scale LLMs.