[Deprecated] Post-training Quantization with POT

Danger

Post-training Optimization Tool is deprecated since OpenVINO 2023.0. Neural Network Compression Framework (NNCF) is recommended for the post-training quantization instead.

For the needs of post-training optimization, OpenVINO provides a Post-training Optimization Tool (POT) which supports the uniform integer quantization method. This method allows moving from floating-point precision to integer precision (for example, 8-bit) for weights and activations during inference time. It helps to reduce the model size, memory footprint and latency, as well as improve the computational efficiency, using integer arithmetic. During the quantization process, the model undergoes the transformation process when additional operations, that contain quantization information, are inserted into the model. The actual transition to integer arithmetic happens at model inference.

The post-training quantization algorithm takes samples from the representative dataset, inputs them into the network, and calibrates the network based on the resulting weights and activation values. Once calibration is complete, values in the network are converted to 8-bit integer format.

While post-training quantization makes your model run faster and take less memory, it may cause a slight reduction in accuracy. If you performed post-training quantization on your model and find that it isn’t accurate enough, try using Quantization-aware Training to increase its accuracy.

Post-Training Quantization Quick Start Examples:
Try out these interactive Jupyter Notebook examples to learn the POT API and see post-training quantization in action:

Quantizing Models with POT

The figure below shows the post-training quantization workflow with POT. In a typical workflow, a pre-trained model is converted to OpenVINO IR format using Model Optimizer. Then, the model is quantized with a representative dataset using POT.

OVMS Benchmark Setup Diagram

Post-training Quantization Methods

Depending on your needs and requirements, POT provides two quantization methods that can be used: Default Quantization and Accuracy-aware Quantization.

Default Quantization

Default Quantization uses an unannotated dataset to perform quantization. It uses representative dataset items to estimate the range of activation values in a network and then quantizes the network. This method is recommended to start with, because it results in a fast and accurate model in most cases. To quantize your model with Default Quantization, see the Quantizing Models page.

Accuracy-aware Quantization

Accuracy-aware Quantization is an advanced method that maintains model accuracy within a predefined range by leaving some network layers unquantized. It uses a trade-off between speed and accuracy to meet user-specified requirements. This method requires an annotated dataset and may require more time for quantization. To quantize your model with Accuracy-aware Quantization, see the Quantizing Models with Accuracy Control page.

Quantization Best Practices and FAQs

If you quantized your model and it isn’t accurate enough, visit the Quantization Best Practices page for tips on improving quantized performance. Sometimes, older Intel CPU generations can encounter a saturation issue when running quantized models that can cause reduced accuracy: learn more on the Saturation Issue Workaround page.

Have more questions about post-training quantization or encountering errors using POT? Visit the POT FAQ page for answers to frequently asked questions and solutions to common errors.