Quantization Fundamentals For People In A Hurry
Quantization

Quantization Fundamentals For People In A Hurry


Quantization Principle




What is Quantization?


Advantages of Quantization


1. Numerical Representations

figure 1-1

Computers use a fixed number of bits to represent numbers but how can python handle such big numbers without any problems.

  • When you run 2???? on Python you will get a result which is much bigger than any 64-bit numbers.
  • How is this possible without any problem?
  • Python Uses BigNum Arithmetic
  • In figure 1-1, we represented number 6 using the base of 2. 6 can be represented using a single digit when it's represented using base-10. But When it is represented using base 2. you need 3 digits. So we can infer that smaller the base larger the number of digits we need to represent the number.
  • Python does the inverse of above scenario. It saves all these numbers as as an array of digits in which each digit is the digit of the number in base 23?. So overall we need less digit to store big numbers.

figure 1-2


  • If we need to store the number 2???? in base of 10 we need 3,010 digits to store it in memory. While Python store it as an array of 23?. So it needs only 334 elements in which all the elements are zero except the most significant one which is equal to 512.
  • When can do a sanity check by multiplying the base 23? and then multiplying again by the index position, 333 of 512 we get 2????.
  • This BigNum Arithmetic is implemented by cpython, which is the Python interpreter not by the CPU. For example, When you compile C++ code, the code will run directly on the hardware on the CPU, which means that the C++ code is compiled for the specific hardware, While python code we never compile it directly. cpython will take care of the translation of python instructions into machine code in process called Just-in Time Compilation


2. Floating Point Number representations


Figure - 2

In Figure- 2,

  • First 1-bit indicates the sign
  • The next 8-bits indicates the exponent implies magnitude of the number. It shows how big a number is.
  • The next 23-bits indicate the fractional part of the number so all the digits corresponding to the negative powers of two.
  • To convert this bit string into a value in decimal format, we use the formula mentioned in the above Figure -2.
  • Modern GPUs support 16-bit floating point number, with less precision.


3. Introduction to Quantization


Figure - 3.1 - 1

3.1 Applying Quantization


Figure 3.1 - 2

  • Consider the above [Figure 3.1 - 2] , in the first hidden layer of the neural network represents an operation which is input X multiplied by a weight matrix W plus a bias B .The goal of quantization is to quantize the input X, weight matrix W, plus a bias B matrix into integers. So the operation become arithmetic operations. They are much faster than floating point operations.
  • We then take put the output from this layer and dequantize it before feeding to the next layer. The dequantization is done in such a way that the next layer should not even realize that there have been a quantization in the previous layer. In short we need to perform quantization in such a way that the model's output should not change because of quantization. In other words, we are not willing to compromise the accuracy of the model.
  • So we need to find the appropriate mapping between floating point numbers and integers and vice versa in such a way that we don't loose precision of the model but at the same time we need to optimize the space occupation of the model in the RAM and and on the disk and we want to make it faster to compute these operations.


Figure 3.1 - 3

  • In the above Figure 3.1 -3, the First hidden layer has a 5x5 weight matrix. By applying quantization we reduce the precision of each number in the weight matrix by mapping it into a range that occupies less bits. For example, 2484.8 occupies 4 bytes or 32 bits. We want to Quantize it to using only 8 bits.
  • With 8-bits, we can represent values from -127 to 127. So we perform this 8-bit quantization and dequantize it in the successive step.
  • While Performing dequantization, we should obtain the previous original matrix but we will loose some precision here. For example when you look at the second value in the dequantized matrix, it has changed from -323.89 to -332.61
  • We need to minimize the loosing of precision as much as possible


4. Types of Quantization

Figure - 4.1 - 1

  • The goal of asymmetric quantization is to map the original tensor which is distributed in the [figure 4.1 - 1] range [-44.93,43.31] to another range [0,255]
  • The other type of quantization is the symmetric quantization.Here we map the original tensor to a symmetric range even it is not symmetric.
  • In the above figure, the original tensor values are in the range of [43.31,-44.93]. They are not symmetric with respect to zero. But still we treat the original tensor as symmetric ([-44.93,44.93]). This gives the advantage that the zero is always mapped into the zero in the quantized model


4.1 Asymmetric Quantization

Figure - 4.1 - 1
Asymmetric Dequantization Formula

  • We can see that the numbers when dequantized are similar but are not exactly the same.

4.2 Symmetric Quantization


Symmetric Quantization

  • Consider the top-most tensor in the above [Figure- 4.2 -1]. The Quantization formula is as follows.

Symmetric Quantization Formula


  • We can see that some precision is lost in the process. Our goal is to make the dequantized tensor similar to the original tensor. There are a number of ways to increase the number of bits of the quantization.
  • We cannot just choose any number of bits. we want to run the matrix multiplication in the linear layer to be accelerated by the CPU. The CPU always works with fixed number of bits and the operations in the CPU are optimized for a fixed number of bits.
  • For example, we have optimization of 8-bits, 16-bits, 32-bits and 64-bits. But If we choose 11-bits for Quantization the CPU may not support the acceleration of operations using 11-bits.

So we have to be careful to choose a good compromise between the number of bits and also the availability of the hardware.


4.3 Applying Quantization - Floating point Case

Figure - 4.3 - 1

All the numbers are floating points in the above network.

  • But when we quantize the weight matrix which is fixed by calculating the α and β. We can also quantize the bias matrix, which is also fixed.
  • How do you quantize the input matrix?
  • One way is to use a method called dynamic quantization, to quantize the input vectors. In dynamic quantization for every input we receive "on the fly" we calculate α and β the and then we can quantize it on the fly.
  • Now we have quantized inputs and now we can perform all the matrix multiplications, which is now an integer matrix multiplications. The output will be which will be an integer matrix Yq. But it is quantized now. we need to de-quantize it back.
  • One way to de-quantize, the output is to use a method called Calibration. In calibration, we take the network and run some input through the network and check what are the typical values of α and β . By using this typical values of α and β . We can check what could be a reasonable and for these values. Then we can use the output of the integer matrix multiplication and use the scale and the zero parameter that we have. We can collect the statistics and dequantize it back into a floating point number.


4.4 Applying Quantization - Integer Case

Figure 4.4 - 1




4.5 Low-precision Multiplication

Figure- 4.5 - 1

  • In accumulator we sum all the outputs from matrix multiplication. Let's say X?,?,W?,?,....are all 8-bits. When we multiply them the result may not be in 8-bits. It could be more. For this reason we usually use a 32-bit accumulator. That's why we also quantize bias terms as 32-bits


4.6 Choosing alpha(α) and beta(β)


Figure - 4.6 -1

  • The strategy we used before is called Min-Max strategy. There are other strategies also.
  • Min-Max quantization strategy is sensitive to outliers, which causes high quantization error.
  • A solution to this is to use alternative strategy called percentile strategy. We set α and β to be a percentile of the original distribution not the maximum and the minimum. Example 99th percentile.

Figure - 4.6 - 2


Figure 4.6 - 3

  • Cross-Entropy Strategy: In LLMs, we have the last layer which is a linear layer plus softmax which allow us to use a token from the vocabulary. The goal of softmax layer is to create a probability distribution. we use greedy or beam Search or Top-P strategy for this. So what we are not concerned about the values inside this distribution but the distribution itself.
  • So the biggest number should remain the biggest number in the quantized values and the intermediate numbers should not change the relative distribution. In this case we use the cross-entropy strategy. So we choose α and β such that the cross-entropy between de-quantized value and the original value is minimum.


4.7 Quantization Of Granularity


Figure 4.7 - 1

  • Convolutional layers are made up of many filters or kernels. Each kernel is ran through an image to calculate specific features.
  • Each kernels are made of parameters which may be distributed differently. For example, you will have a kernel distributed between -5 and +5. Another with -10 and +10. and another between -6 and +6.
  • If we use same α and β for all of them, we will be wasting quantization range for some of the kernels. In such cases it is better to perform [channel-wise quantization].In channel-wise quantization we calculate and for each kernel separately and they will be different for each. This would result in better quality quantization. Hence we loose less precision.


4.8 Post Quantization Training

Figure - 4.8 - 1

In Figure 4.8 -1:

  • We have a pre-trained model that we need to quantize.
  • For example, let's say the pre-trained model is a model that can classify cats and dogs. We use the pictures of cats and dogs as data which may not always come from the training set.
  • We take the pre-trained model and attach some observers which will collect some statistics (eg: maximum value , minimum value) while we are running inference on the model. This statistics will be used to calculate the scaling parameter, and zero point for each layer of the model and then we can use them to quantize the model.


4.9 Quantization Aware Training

Figure - 4.9 - 1

  • We insert some sequence of fake quantize and dequantize operations between each layer. This is done on the fly. This will introduce some quantization error. We hope that the loss function will learn to be more robust against this error which will usually lead to better model performance

4.9.1 Quantization Aware Training : Gradient

Figure - 4.9 - 2

  • In QAT, we are introducing some observers between each layer. They perform some quantize and de-quantize operations between each layer. We perform these operations while training which means that the backpropagation algorithm is going to calculate the gradient of the loss function with respect to these operations we are doing. But these quantization operations are not differentiable.
  • How can backpropagation algorithm calculates the gradient of the quantization operation that we are doing?
  • We use an estimation method called Straight through Estimator approximation. i.e, for all the values being quantized in the range we assume the value 1 and if the values are outside we assume zero.


4.9.2 Quantization Aware Training : Why it Works?


Figure - 4.9 - 3

  • When we train a model that has no notion of quantization, and loss function is computed for a particular weight. The goal of the gradient descent algorithm is to calculate the weights such that we minimize the loss. we usually end up in local minima
  • The goal of QAT is to reach the local minima that is more wide. why?
  • The weight value will change after quantization. In Figure - 4.9 - 3, on the right plot, the loss will increase a lot after quantization, since it is narrow.
  • But in QAT, we choose a local minima that is more wide, so that if the weight moves a little bit after training; the loss will not increase by much. This is why quantization awareness training works.


Link to Original Source:

  1. https://youtu.be/0VdNflU08yA
  2. https://github.com/hkproj/quantization-note


要查看或添加评论,请登录

Nithin M A的更多文章

社区洞察

其他会员也浏览了