Post-training Quantization#

Post-training quantization is a method of reducing the size of a model, to make it lighter, faster, and less resource hungry. Importantly, this process does not require retraining, fine-tuning, or using training datasets and pipelines in the source framework. With NNCF, you can perform 8-bit quantization, using mainly the two flows:

Requires only a representative calibration dataset.
Ensures the accuracy of the resulting model does not drop below a certain value. To do so, it requires both a calibration and a validation datasets, as well as a validation function to calculate the accuracy metric.

Why 8-bit post-training quantization#

The 8-bit quantization is just one of the available compression methods but one often selected for:

  • significant performance results,

  • little impact on accuracy,

  • ease of use,

  • wide hardware compatibility.

It lowers model weight and activation precisions to 8 bits (INT8), which for an FP64 model is just a quarter of the original footprint, leading to a significant improvement in inference speed.

../../_images/quantization_picture.svg

Additional Resources#