Low Precision Optimization Guide

Introduction

The purpose of this document is to provide best-known methods on how to use low-precision capabilities of the OpenVINO toolkit. Currently, these capabilities are represented by several components:

The first two components are the part of OpenVINO toolkit itself while the latter one represents OpenVINO extension which is highly aligned with toolkit and called OpenVINO Training Extensions (OTE).

This document covers high level aspects of model optimization flow in OpenVINO.

General information

By low precision we imply the inference of Deep Learning models in the precision which is lower than 32 or 16 bits, such as FLOAT32 and FLOAT16. For example, the most popular bit-width for the low-precision inference is INT8 (UINT8). Such models are represented by the quantized models, i.e. the models that were trained in the floating-point precision and then transformed to integer representation with floating/fixed-point quantization operations between the layers.

Starting from the OpenVINO 2020.1 release all the quantized models are represented using so-called FakeQuantize layer which is a very expressive primitive and is able to represent such operations as Quantize, Dequantize, Requantize, and even more. For more details about this operation please refer to the following description.

In order to execute such "fake-quantized" models, OpenVINO has a low-precision runtime which consists from a generic part translating the model to real integer representation and HW-specific part implemented in the corresponding HW plug-ins.

Model optimization flow

The diagram below shows a common optimization flow for the new model with OpenVINO and relative tools.

low_precision_flow.png