Low Precision Optimization Guide

Introduction

This document provides the best-known methods on how to use low-precision capabilities of the OpenVINO™ toolkit to transform models to more hardware-friendly representation using such methods as quantization.

Currently, these capabilities are represented by several components:

The first two components are the part of OpenVINO toolkit itself while the latter one is a separate tool build on top of the PyTorch* framework and highly aligned with OpenVINO™.

This document covers high level aspects of model optimization flow in OpenVINO™.

General Information

By low precision we imply the inference of Deep Learning models in the precision which is lower than 32 or 16 bits, such as FLOAT32 and FLOAT16. For example, the most popular bit-width for the low-precision inference is INT8 (UINT8) because it is possible to get accurate 8-bit models which substantially speed up the inference. Such models are represented by the quantized models, i.e. the models that were trained in the floating-point precision and then transformed to integer representation with floating/fixed-point quantization operations between the layers. This transformation can be done using post-training methods or with additional retraining/fine-tuning.

Starting from the OpenVINO 2020.1 release all the quantized models are represented using so-called FakeQuantize layer which is a very expressive primitive and is able to represent such operations as Quantize, Dequantize, Requantize, and even more. This operation is inserted into the model during quantization procedure and is aimed to store quantization parameters for the layers. For more details about this operation please refer to the following description.

In order to execute such "fake-quantized" models, OpenVINO has a low-precision runtime which is a part of Inference Engine and consists of a generic component translating the model to real integer representation and HW-specific part implemented in the corresponding HW plug-ins.

Model Optimization Workflow

We propose a common workflow which aligns with what other DL frameworks have. It contains two main components: post-training quantization and Quantization-Aware Training (QAT). The first component is the the easiest way to get optimized models where the latter one can be considered as an alternative or an addition when the first does not give accurate results.

The diagram below shows the optimization flow for the new model with OpenVINO and relative tools.

low_precision_flow.png