登录查看更多内容

Quantization Fundamentals For People In A Hurry

Nithin M A

| Senior Data Analyst | Data Scientist | NLP | Engineer | Artificial Intelligence Practitioner | Generative Artificial Intelligence | LLMs | RAG | Agents |

发布日期: 2024年9月28日

+ 关注

1. Numerical Representations

Computers use a fixed number of bits to represent numbers but how can python handle such big numbers without any problems.

When you run 2???? on Python you will get a result which is much bigger than any 64-bit numbers.
How is this possible without any problem?
Python Uses BigNum Arithmetic
In figure 1-1, we represented number 6 using the base of 2. 6 can be represented using a single digit when it's represented using base-10. But When it is represented using base 2. you need 3 digits. So we can infer that smaller the base larger the number of digits we need to represent the number.
Python does the inverse of above scenario. It saves all these numbers as as an array of digits in which each digit is the digit of the number in base 23?. So overall we need less digit to store big numbers.

If we need to store the number 2???? in base of 10 we need 3,010 digits to store it in memory. While Python store it as an array of 23?. So it needs only 334 elements in which all the elements are zero except the most significant one which is equal to 512.
When can do a sanity check by multiplying the base 23? and then multiplying again by the index position, 333 of 512 we get 2????.
This BigNum Arithmetic is implemented by cpython, which is the Python interpreter not by the CPU. For example, When you compile C++ code, the code will run directly on the hardware on the CPU, which means that the C++ code is compiled for the specific hardware, While python code we never compile it directly. cpython will take care of the translation of python instructions into machine code in process called Just-in Time Compilation

2. Floating Point Number representations

In Figure- 2,

First 1-bit indicates the sign
The next 8-bits indicates the exponent implies magnitude of the number. It shows how big a number is.
The next 23-bits indicate the fractional part of the number so all the digits corresponding to the negative powers of two.
To convert this bit string into a value in decimal format, we use the formula mentioned in the above Figure -2.
Modern GPUs support 16-bit floating point number, with less precision.

3. Introduction to Quantization

3.1 Applying Quantization

Consider the above [Figure 3.1 - 2] , in the first hidden layer of the neural network represents an operation which is input X multiplied by a weight matrix W plus a bias B .The goal of quantization is to quantize the input X, weight matrix W, plus a bias B matrix into integers. So the operation become arithmetic operations. They are much faster than floating point operations.
We then take put the output from this layer and dequantize it before feeding to the next layer. The dequantization is done in such a way that the next layer should not even realize that there have been a quantization in the previous layer. In short we need to perform quantization in such a way that the model's output should not change because of quantization. In other words, we are not willing to compromise the accuracy of the model.
So we need to find the appropriate mapping between floating point numbers and integers and vice versa in such a way that we don't loose precision of the model but at the same time we need to optimize the space occupation of the model in the RAM and and on the disk and we want to make it faster to compute these operations.

In the above Figure 3.1 -3, the First hidden layer has a 5x5 weight matrix. By applying quantization we reduce the precision of each number in the weight matrix by mapping it into a range that occupies less bits. For example, 2484.8 occupies 4 bytes or 32 bits. We want to Quantize it to using only 8 bits.
With 8-bits, we can represent values from -127 to 127. So we perform this 8-bit quantization and dequantize it in the successive step.
While Performing dequantization, we should obtain the previous original matrix but we will loose some precision here. For example when you look at the second value in the dequantized matrix, it has changed from -323.89 to -332.61
We need to minimize the loosing of precision as much as possible

4. Types of Quantization

The goal of asymmetric quantization is to map the original tensor which is distributed in the [figure 4.1 - 1] range [-44.93,43.31] to another range [0,255]
The other type of quantization is the symmetric quantization.Here we map the original tensor to a symmetric range even it is not symmetric.
In the above figure, the original tensor values are in the range of [43.31,-44.93]. They are not symmetric with respect to zero. But still we treat the original tensor as symmetric ([-44.93,44.93]). This gives the advantage that the zero is always mapped into the zero in the quantized model

4.1 Asymmetric Quantization

We can see that the numbers when dequantized are similar but are not exactly the same.

4.2 Symmetric Quantization

Consider the top-most tensor in the above [Figure- 4.2 -1]. The Quantization formula is as follows.

领英推荐

Supercharge Python with DeepMind's JAX

Patrick Nicolas 1 年前

x100 speed improvement with NuCS

Yan Georget 5 个月前

Deep Dive Numpy Package in Python

SIMANTA JYOTI MEDHI 4 年前

We can see that some precision is lost in the process. Our goal is to make the dequantized tensor similar to the original tensor. There are a number of ways to increase the number of bits of the quantization.
We cannot just choose any number of bits. we want to run the matrix multiplication in the linear layer to be accelerated by the CPU. The CPU always works with fixed number of bits and the operations in the CPU are optimized for a fixed number of bits.
For example, we have optimization of 8-bits, 16-bits, 32-bits and 64-bits. But If we choose 11-bits for Quantization the CPU may not support the acceleration of operations using 11-bits.

So we have to be careful to choose a good compromise between the number of bits and also the availability of the hardware.

4.3 Applying Quantization - Floating point Case

All the numbers are floating points in the above network.

But when we quantize the weight matrix which is fixed by calculating the α and β. We can also quantize the bias matrix, which is also fixed.
How do you quantize the input matrix?
One way is to use a method called dynamic quantization, to quantize the input vectors. In dynamic quantization for every input we receive "on the fly" we calculate α and β the and then we can quantize it on the fly.
Now we have quantized inputs and now we can perform all the matrix multiplications, which is now an integer matrix multiplications. The output will be which will be an integer matrix Yq. But it is quantized now. we need to de-quantize it back.
One way to de-quantize, the output is to use a method called Calibration. In calibration, we take the network and run some input through the network and check what are the typical values of α and β . By using this typical values of α and β . We can check what could be a reasonable and for these values. Then we can use the output of the integer matrix multiplication and use the scale and the zero parameter that we have. We can collect the statistics and dequantize it back into a floating point number.

4.4 Applying Quantization - Integer Case

4.5 Low-precision Multiplication

In accumulator we sum all the outputs from matrix multiplication. Let's say X?,?,W?,?,....are all 8-bits. When we multiply them the result may not be in 8-bits. It could be more. For this reason we usually use a 32-bit accumulator. That's why we also quantize bias terms as 32-bits

4.6 Choosing alpha(α) and beta(β)

The strategy we used before is called Min-Max strategy. There are other strategies also.
Min-Max quantization strategy is sensitive to outliers, which causes high quantization error.
A solution to this is to use alternative strategy called percentile strategy. We set α and β to be a percentile of the original distribution not the maximum and the minimum. Example 99th percentile.

Cross-Entropy Strategy: In LLMs, we have the last layer which is a linear layer plus softmax which allow us to use a token from the vocabulary. The goal of softmax layer is to create a probability distribution. we use greedy or beam Search or Top-P strategy for this. So what we are not concerned about the values inside this distribution but the distribution itself.
So the biggest number should remain the biggest number in the quantized values and the intermediate numbers should not change the relative distribution. In this case we use the cross-entropy strategy. So we choose α and β such that the cross-entropy between de-quantized value and the original value is minimum.

4.7 Quantization Of Granularity

Convolutional layers are made up of many filters or kernels. Each kernel is ran through an image to calculate specific features.
Each kernels are made of parameters which may be distributed differently. For example, you will have a kernel distributed between -5 and +5. Another with -10 and +10. and another between -6 and +6.
If we use same α and β for all of them, we will be wasting quantization range for some of the kernels. In such cases it is better to perform [channel-wise quantization].In channel-wise quantization we calculate and for each kernel separately and they will be different for each. This would result in better quality quantization. Hence we loose less precision.

4.8 Post Quantization Training

In Figure 4.8 -1:

We have a pre-trained model that we need to quantize.
For example, let's say the pre-trained model is a model that can classify cats and dogs. We use the pictures of cats and dogs as data which may not always come from the training set.
We take the pre-trained model and attach some observers which will collect some statistics (eg: maximum value , minimum value) while we are running inference on the model. This statistics will be used to calculate the scaling parameter, and zero point for each layer of the model and then we can use them to quantize the model.

4.9 Quantization Aware Training

We insert some sequence of fake quantize and dequantize operations between each layer. This is done on the fly. This will introduce some quantization error. We hope that the loss function will learn to be more robust against this error which will usually lead to better model performance

4.9.1 Quantization Aware Training : Gradient

In QAT, we are introducing some observers between each layer. They perform some quantize and de-quantize operations between each layer. We perform these operations while training which means that the backpropagation algorithm is going to calculate the gradient of the loss function with respect to these operations we are doing. But these quantization operations are not differentiable.
How can backpropagation algorithm calculates the gradient of the quantization operation that we are doing?
We use an estimation method called Straight through Estimator approximation. i.e, for all the values being quantized in the range we assume the value 1 and if the values are outside we assume zero.

4.9.2 Quantization Aware Training : Why it Works?

When we train a model that has no notion of quantization, and loss function is computed for a particular weight. The goal of the gradient descent algorithm is to calculate the weights such that we minimize the loss. we usually end up in local minima
The goal of QAT is to reach the local minima that is more wide. why?
The weight value will change after quantization. In Figure - 4.9 - 3, on the right plot, the loss will increase a lot after quantization, since it is narrow.
But in QAT, we choose a local minima that is more wide, so that if the weight moves a little bit after training; the loss will not increase by much. This is why quantization awareness training works.

Link to Original Source:

要查看或添加评论，请登录

Nithin M A的更多文章

The Nobel Prize in Physics 2024

2024年10月9日

The Nobel Prize in Physics 2024

The Nobel Prize in Physics 2024 was awarded to John J. Hopfield and Geoffrey E.
Unraveling LLaMA: Inside Meta's Revolutionary Model

2024年10月5日

Unraveling LLaMA: Inside Meta's Revolutionary Model

Ever wondered what makes Meta's LLaMA tick? Grab your favorite beverage, because we're about to dive into the world of…
Introduction to Statistical NLP : Remembering Old Sports : Part - 1

2024年9月23日

Introduction to Statistical NLP : Remembering Old Sports : Part - 1

In the realm of language and technology, where human expression meets computational power, a fascinating field has…
Transformers: An AI Architecture with Self-Attention - A Beginner Freindly Guide

2024年9月8日

Transformers: An AI Architecture with Self-Attention - A Beginner Freindly Guide

Introduction: The Rise of Transformers in AI Transformers have dramatically changed how machines understand language…
The Large Language Model Landscape: Evaluating the Strengths and Limitations of ChatGPT, Gemini, Claude, Perplexity AI, and LLAMA

2024年5月21日

The Large Language Model Landscape: Evaluating the Strengths and Limitations of ChatGPT, Gemini, Claude, Perplexity AI, and LLAMA

The field of conversational AI has seen rapid advancements, with language models emerging as powerful tools that can…

1 条评论
Gen AI Internship Program:Day 3

2024年5月15日

Gen AI Internship Program:Day 3

What We Learned Today? ChatGPT Prompt Engineering for Developers by deeplearning.ai Today's session was all about…

3 条评论
Roll the Dice:

2024年5月7日

Roll the Dice:

- Charles Bukowski
Wine Review Analysis: Exploring Machine Learning for Quality Prediction Introduction The quality of wine is a complex interplay of various fact

2024年5月6日

Wine Review Analysis: Exploring Machine Learning for Quality Prediction Introduction The quality of wine is a complex interplay of various fact

Introduction The quality of wine is a complex interplay of various factors, including acidity levels, sugar content…
Little Bit Of PepTalk......

2024年4月30日

Little Bit Of PepTalk......

IF..
Resume Refiner Analyzer v1.0

2024年4月27日

Resume Refiner Analyzer v1.0

Overview The Resume Refiner Analyzer is an innovative application designed to empower job seekers with tools to refine…

See all articles

Quantization Fundamentals For People In A Hurry

Nithin M A

| Senior Data Analyst | Data Scientist | NLP | Engineer | Artificial Intelligence Practitioner | Generative Artificial Intelligence | LLMs | RAG | Agents |

1. Numerical Representations

2. Floating Point Number representations

3. Introduction to Quantization

3.1 Applying Quantization

4. Types of Quantization

4.1 Asymmetric Quantization

4.2 Symmetric Quantization

领英推荐

4.3 Applying Quantization - Floating point Case

4.4 Applying Quantization - Integer Case

4.5 Low-precision Multiplication

4.6 Choosing alpha(α) and beta(β)

4.8 Post Quantization Training

4.9 Quantization Aware Training

4.9.1 Quantization Aware Training : Gradient

4.9.2 Quantization Aware Training : Why it Works?

Link to Original Source:

Nithin M A的更多文章

社区洞察

其他会员也浏览了

The Kernel Has died? Don't Panic!

Unlocking the Secrets of PDF Parsing: A Comparative Analysis of Python Libraries

Encrypted computation with decimal numbers

SciPy

Perform CPU Profiling for Python Code

Scipy

Realtime Semantic Segmentation on Jetson Nano in Python and C++

Actuarial Modelling on GPU with Python

How to create a 16-Qubit Random Number Generator on a Quantum Computer

1. Numerical Representations

2. Floating Point Number representations

3. Introduction to Quantization

3.1 Applying Quantization

4. Types of Quantization

4.1 Asymmetric Quantization

4.2 Symmetric Quantization

领英推荐

4.3 Applying Quantization - Floating point Case

4.4 Applying Quantization - Integer Case

4.5 Low-precision Multiplication

4.6 Choosing alpha(α) and beta(β)

4.8 Post Quantization Training

4.9 Quantization Aware Training

4.9.1 Quantization Aware Training : Gradient

4.9.2 Quantization Aware Training : Why it Works?

Link to Original Source:

Nithin M A的更多文章

The Nobel Prize in Physics 2024

Unraveling LLaMA: Inside Meta's Revolutionary Model

Introduction to Statistical NLP : Remembering Old Sports : Part - 1

Transformers: An AI Architecture with Self-Attention - A Beginner Freindly Guide

The Large Language Model Landscape: Evaluating the Strengths and Limitations of ChatGPT, Gemini, Claude, Perplexity AI, and LLAMA

Gen AI Internship Program:Day 3

Roll the Dice:

Wine Review Analysis: Exploring Machine Learning for Quality Prediction Introduction The quality of wine is a complex interplay of various fact

Little Bit Of PepTalk......

Resume Refiner Analyzer v1.0

社区洞察

其他会员也浏览了

The Kernel Has died? Don't Panic!

Unlocking the Secrets of PDF Parsing: A Comparative Analysis of Python Libraries

Encrypted computation with decimal numbers

SciPy

Perform CPU Profiling for Python Code

Scipy

Realtime Semantic Segmentation on Jetson Nano in Python and C++

Actuarial Modelling on GPU with Python

How to create a 16-Qubit Random Number Generator on a Quantum Computer