Post-training Quantization#

Post-training quantization is a method of reducing the size of a model, to make it lighter, faster, and less resource hungry. Importantly, this process does not require retraining, fine-tuning, or using training datasets and pipelines in the source framework. With NNCF, you can perform 8-bit quantization, using mainly the two flows:

Basic quantization (simple):

Requires only a representative calibration dataset.

Accuracy-aware Quantization (advanced):

Ensures the accuracy of the resulting model does not drop below a certain value. To do so, it requires both a calibration and a validation datasets, as well as a validation function to calculate the accuracy metric.

Why 8-bit post-training quantization#

The 8-bit quantization is just one of the available compression methods but one often selected for:

significant performance results,
little impact on accuracy,
ease of use,
wide hardware compatibility.

It lowers model weight and activation precisions to 8 bits (INT8), which for an FP64 model is just a quarter of the original footprint, leading to a significant improvement in inference speed.

Post-training Quantization#

Why 8-bit post-training quantization#

Additional Resources#