Model Optimization - NNCF#
Model optimization means altering the model itself to improve its performance and reduce its size. It is an optional step, typically used only at the development stage, so that a pre-optimized model is used in the final AI application.
In OpenVINO, the default optimization tool is NNCF (Neural Network Compression Framework). It is a set of compression algorithms, organized as a Python package, that make your models smaller and faster. Note that NNCF is not part of the OpenVINO package, so it needs to be installed separately. It supports models in PyTorch, TensorFlow , ONNX, and OpenVINO IR formats, offering the following main optimizations:
Recommended workflows#
A common approach for most cases is to:
Perform post-training quantization first, as it is the easiest option.
For even better results, combine post-training quantization with filter pruning.
If the accuracy drop is unacceptable, use quantization-aware training instead. It will give you the same level of performance boost, with a smaller impact on accuracy.
Weight compression works only with LLMs. Do not try to use it with other models.
For visual-multimodal use cases, the encoder / decoder split approach may be recommended.
Installation and usage#
To learn about the full scope of the framework, its installation, and technical details, visit both the NNCF repository and NNCF API documentation.
pip install nncf
conda install -c conda-forge nncf
For more installation details, see the page on NNCF Installation.
Full requirement listing is available in the NNCF GitHub Repository
Note that to optimize a model, you will need to install this model’s framework as well. Install NNCF in the same Python environment as the framework. For a list of recommended framework versions, see the framework compatibility table.
Note
Once optimized, models may be executed with the typical OpenVINO inference workflow, no additional changes to the inference code are required.
This is true for models optimized using NNCF, as well as those pre-optimized in their source frameworks, such as PyTorch, TensorFlow, and ONNX (in Q/DQ; Quantize/DeQuantize format). The latter may be easily converted to the OpenVINO Intermediate Representation format (IR) right away.
Hugging Face Optimum Intel offers OpenVINO integration with Hugging Face models and pipelines. NNCF serves as the compression backend within the Hugging Face Optimum Intel, integrating with the widely used transformers library to enhance model performance.