Quantization

The primary optimization feature of the Post-training Optimization Toolkit (POT) is uniform quantization. In general, this method supports an arbitrary number of bits, greater or equal to two, which represents weights and activations. During the quantization process, the method inserts FakeQuantize operations into the model graph automatically based on a predefined hardware target in order to produce the most hardware-friendly optimized model:

After that, different quantization algorithms can tune the `FakeQuantize`

parameters or remove some of them in order to meet the accuracy criteria. The resulting *fakequantized* models are interpreted and transformed to real low-precision models during inference at the OpenVINO™ Inference Engine runtime giving real performance improvement.

Currently, the POT provides two algorithms for 8-bit quantization, which are verified and provide stable results on a wide range of DNN models:

**DefaultQuantization**is a default method that provides fast and in most cases accurate results for 8-bit quantization. For details, see the DefaultQuantization Algorithm documentation.**AccuracyAwareQuantization**enables remaining at a predefined range of accuracy drop after quantization at the cost of performance improvement. It may require more time for quantization. For details, see the AccuracyAwareQuantization Algorithm documentation.

There is also **TunableQuantization** that enables tuning of quantization hyperparameters. It is a variant of **MinMaxQuantization**, which is a part of the **DefaultQuantization** pipeline and provided for use with a global optimizer to tune a possible quantization scheme based on a predefined accuracy drop and latency improvement criteria. TunableQuantization is usually used as a part of a pipeline with auxiliary algorithms. See the TunableQuantization Algorithm documentation.

Quantization is parametrized by clamping the range and the number of quantization levels:

\[ output = \frac{\left\lfloor (clamp(input; input\_low, input\_high)-input\_low) *s\right \rceil}{s} + input\_low\\ \]

\[ clamp(input; input\_low, input\_high) = min(max(input, input\_low), input\_high))) \]

\[ s=\frac{levels-1}{input\_high - input\_low} \]

In the formulas:

`input_low`

and`input_high`

represent the quantization range\[\left\lfloor\cdot\right \rceil\]

denotes rounding to the nearest integer

The POT supports symmetric and asymmetric quantization of weights and activations, which are controlled by the `preset`

. The main difference between them is that in the symmetric mode the floating-point zero is mapped directly to the integer zero, while in asymmetric the mode it can be an arbitrary integer number. In any mode, the floating-point zero is mapped directly to the quant without rounding an error. See this tutorial for details.

Below is the detailed description of quantization formulas for both modes. These formulas are used both in the POT to quantize weights of the model and in the OpenVINO™ Inference Engine runtime when quantizing activations during the inference.

The formula is parametrized by the `scale`

parameter that is tuned during the quantization process:

\[ input\_low=scale*\frac{level\_low}{level\_high} \]

\[ input\_high=scale \]

Where `level_low`

and `level_high`

represent the range of the discrete signal.

For weights:

\[ level\_low=-2^{bits-1}+1 \]

\[ level\_high=2^{bits-1}-1 \]

\[ levels=255 \]

For unsigned activations:

\[ level\_low=0 \]

\[ level\_high=2^{bits}-1 \]

\[ levels=256 \]

For signed activations:

\[ level\_low=-2^{bits-1} \]

\[ level\_high=2^{bits-1}-1 \]

\[ levels=256 \]

The quantization formula is parametrized by `input_low`

and `input_range`

that are tunable parameters:

\[ input\_high=input\_low + input\_range \]

\[ levels=256 \]

For weights and activations the following quantization mode is applied:

\[ {input\_low}' = min(input\_low, 0) \]

\[ {input\_high}' = max(input\_high, 0) \]

\[ ZP= \left\lfloor \frac{-{input\_low}'*(levels-1)}{{input\_high}'-{input\_low}'} \right \rceil \]

\[ {input\_high}''=\frac{ZP-levels+1}{ZP}*{input\_low}' \]

\[ {input\_low}''=\frac{ZP}{ZP-levels+1}*{input\_high}' \]

\[ {input\_low,input\_high} = \begin{cases} {input\_low}',{input\_high}', & ZP \in $\{0,levels-1\}$ \\ {input\_low}',{input\_high}'', & {input\_high}'' - {input\_low}' > {input\_high}' - {input\_low}'' \\ {input\_low}'',{input\_high}', & {input\_high}'' - {input\_low}' <= {input\_high}' - {input\_low}''\\ \end{cases} \]