Post-training Quantization#
Post-training quantization is a method of reducing the size of a model, to make it lighter, faster, and less resource hungry. Importantly, this process does not require retraining, fine-tuning, or using training datasets and pipelines in the source framework. With NNCF, you can perform 8-bit quantization, using mainly the two flows:
Why 8-bit post-training quantization#
The 8-bit quantization is just one of the available compression methods but one often selected for:
significant performance results,
little impact on accuracy,
ease of use,
wide hardware compatibility.
It lowers model weight and activation precisions to 8 bits (INT8), which for an FP64 model is just a quarter of the original footprint, leading to a significant improvement in inference speed.