Quantization

The primary optimization feature of the toolkit is a uniform quantization. In general, this method supports an arbitrary number of bits (>=2) which is used to represent weights and activations. During the quantization process, the so-called FakeQuantize operations are inserted into the model graph automatically based on the predefined hardware target in order to produce the most hardware-friendly optimized model. After that, different quantization algorithms can tune the FakeQuantize parameters or remove some operations in order to meet the accuracy criteria. The resulting "fakequantized" models can be interpreted and transformed to real low-precision models at runtime getting real performance improvement.

Quantization algorithms

The toolkit provides multiple quantization and auxiliary algorithms, which help to restore the accuracy after quantizing weights and activations. Algorithms can form independent optimization pipelines you can apply to quantize a model. However, the only two following quantization algorithms for 8-bit precision are verified and recommended to get stable and confident results for DNN model quantization:

TunableQuantization enables tuning of quantization hyperparameters. It is a tunable variant of MinMaxQuantization, which is a part of DefaultQuantization pipeline, and provided for use with a global optimizer to tune a possible quantization scheme based on a predefined accuracy drop and latency improvement criteria. TunableQuantization is usually used as a part of a pipeline with auxiliary algorithms. See TunableQuantization Algorithm documentation.

Quantization formula

Quantization is parametrized by clamping range and number of quantization levels:

\[ output = \frac{\left\lfloor (clamp(input; input\_low, input\_high)-input\_low) *s\right \rceil}{s} + input\_low\\ \]

\[ clamp(input; input\_low, input\_high) = min(max(input, input\_low), input\_high))) \]

\[ s=\frac{levels-1}{input\_high - input\_low} \]

input_low and input_high represents the quantization range and

\[\left\lfloor\cdot\right \rceil\]

denotes rounding to the nearest integer.

The toolkit support two quantization modes: symmetric and asymmetric. The main difference between them is that in the case of the symmetric mode the floating-point zero is mapped directly to integer zero. For asymmetric mode it can be any integer number but in any case the floating-point zero is mapped directly to the quant without rounding error.

Symmetric quantization

The formula is parametrized by the scale parameter that is tuned during quantization process:

\[ input\_low=scale*\frac{level\_low}{level\_high} \]

\[ input\_high=scale \]

Where level_low and level_high represent the range of the discrete signal.

Asymmetric quantization

The quantization formula is parametrized by input_low and input_range that are tunable parameters:

\[ input\_high=input\_low + input\_range \]

\[ levels=256 \]

For weights and activations the following quantization mode is applied:

\[ {input\_low}' = min(input\_low, 0) \]

\[ {input\_high}' = max(input\_high, 0) \]

\[ ZP= \left\lfloor \frac{-{input\_low}'*(levels-1)}{{input\_high}'-{input\_low}'} \right \rceil \]

\[ {input\_high}''=\frac{ZP-levels+1}{ZP}*{input\_low}' \]

\[ {input\_low}''=\frac{ZP}{ZP-levels+1}*{input\_high}' \]

\[ {input\_low,input\_high} = \begin{cases} {input\_low}',{input\_high}', & ZP \in $\{0,levels-1\}$ \\ {input\_low}',{input\_high}'', & {input\_high}'' - {input\_low}' > {input\_high}' - {input\_low}'' \\ {input\_low}'',{input\_high}', & {input\_high}'' - {input\_low}' <= {input\_high}' - {input\_low}''\\ \end{cases} \]