Model Optimization - NNCF#

Model optimization means altering the model itself to improve its performance and reduce its size. It is an optional step, typically used only at the development stage, so that a pre-optimized model is used in the final AI application.

In OpenVINO, the default optimization tool is NNCF (Neural Network Compression Framework). It is a set of compression algorithms, organized as a Python package, that make your models smaller and faster. Note that NNCF is not part of the OpenVINO package, so it needs to be installed separately. It supports models in PyTorch, TensorFlow , ONNX, and OpenVINO IR formats, offering the following main optimizations:

../_images/WHAT_TO_USE.svg
an easy-to-use method for Large Language Model footprint reduction and inference acceleration.
designed to optimize deep learning models by applying 8-bit integer quantization. Being the easiest way to optimize a model it does not require its retraining or fine-tuning but may result in a drop in accuracy. If the accuracy-performance tradeoff is not acceptable, Training-time Optimization may be a better option.
involves a suite of advanced methods such as Structured or Unstructured Pruning, as well as Quantization-aware Training. This kind of optimization requires the use of the model’s original framework, for NNCF, it is either PyTorch or TensorFlow.

Installation and usage#

To learn about the full scope of the framework, its installation, and technical details, visit both the NNCF repository and NNCF API documentation.

pip install nncf
conda install -c conda-forge nncf

For more installation details, see the page on NNCF Installation.

Full requirement listing is available in the NNCF GitHub Repository

Note that to optimize a model, you will need to install this model’s framework as well. Install NNCF in the same Python environment as the framework. For a list of recommended framework versions, see the framework compatibility table.

Note

Once optimized, models may be executed with the typical OpenVINO inference workflow, no additional changes to the inference code are required.

This is true for models optimized using NNCF, as well as those pre-optimized in their source frameworks, such as PyTorch, TensorFlow, and ONNX (in Q/DQ; Quantize/DeQuantize format). The latter may be easily converted to the OpenVINO Intermediate Representation format (IR) right away.

Hugging Face Optimum Intel offers OpenVINO integration with Hugging Face models and pipelines. NNCF serves as the compression backend within the Hugging Face Optimum Intel, integrating with the widely used transformers library to enhance model performance.

Additional Resources#