Optimizing Models Post-training

Post-training model optimization is the process of applying special methods without model retraining or fine-tuning, for example, post-training 8-bit quantization. Therefore, this process does not require a training dataset or a training pipeline in the source DL framework. To apply post-training methods in OpenVINO, you need:

  • A floating-point precision model, FP32 or FP16, converted into the OpenVINO Intermediate Representation (IR) format that can be run on CPU.

  • A representative calibration dataset representing a use case scenario, for example, 300 samples.

  • In case of accuracy constraints, a validation dataset and accuracy metrics should be available.

For the needs of post-training optimization, OpenVINO provides a Post-training Optimization Tool (POT) which supports the uniform integer quantization method. This method allows moving from floating-point precision to integer precision (for example, 8-bit) for weights and activations during the inference time. It helps to reduce the model size, memory footprint and latency, as well as improve the computational efficiency, using integer arithmetic. During the quantization process the model undergoes the transformation process when additional operations, that contain quantization information, are inserted into the model. The actual transition to integer arithmetic happens at model inference.

The figure below shows the optimization workflow with POT:

_images/workflow_simple.png

POT is distributed as a part of OpenVINO Development Tools package and also available on GitHub.

Quantizing models with POT

Depending on your needs and requirements, POT provides two main quantization methods that can be used:

  • Default Quantization a recommended method that provides fast and accurate results in most cases. It requires only an unannotated dataset for quantization. For more details, see the Default Quantization algorithm documentation.

  • Accuracy-aware Quantization an advanced method that allows keeping accuracy at a predefined range, at the cost of performance improvement, when Default Quantization cannot guarantee it. This method requires an annotated representative dataset and may require more time for quantization. For more details, see the Accuracy-aware Quantization algorithm documentation.

Different hardware platforms support different integer precisions and quantization parameters. For example, 8-bit is used by CPU, GPU, VPU, and 16-bit by GNA. POT abstracts this complexity by introducing a concept of the “target device” used to set quantization settings, specific to the device.

Note

There is a special target_device: "ANY" which leads to portable quantized models compatible with CPU, GPU, and VPU devices. GNA-quantized models are compatible only with CPU.

For benchmarking results collected for the models optimized with the POT tool, refer to the INT8 vs FP32 Comparison on Select Networks and Platforms.