Model Optimization - NNCF#

Model optimization means altering the model itself to improve its performance and reduce its size. It is an optional step, typically used only at the development stage, so that a pre-optimized model is used in the final AI application.

In OpenVINO, the default optimization tool is NNCF (Neural Network Compression Framework). It is a set of compression algorithms, organized as a Python package, that make your models smaller and faster. Note that NNCF is not part of the OpenVINO package, so it needs to be installed separately. It supports models in OpenVINO IR, PyTorch and ONNX formats, offering the following main optimizations:

Weight Compression:

An easy-to-use method for Large Language Model footprint reduction and inference acceleration.

Post-training Quantization:

Designed to optimize deep learning models by applying 8-bit integer quantization. Being the easiest way to optimize a model it does not require its retraining or fine-tuning but may result in a drop in accuracy. If the accuracy-performance tradeoff is not acceptable, Training-time Optimization may be a better option.

Training-time Optimization:

Involves a suite of advanced methods such as Sparsity, as well as Quantization-aware Training. This kind of optimization requires the use of the PyTorch framework.

Recommended workflows#

A common approach for most cases is to:
1. Perform post-training quantization first, as it is the easiest option.
2. If the accuracy drop is unacceptable, use quantization-aware training instead. It will give you the same level of performance boost, with a smaller impact on accuracy.
Weight compression works with LLMs, VLMs and other Transformer-based models.

../_images/DEVELOPMENT_FLOW_V3_crunch.svg

Installation and usage#

To learn about the full scope of the framework, its installation, and technical details, visit both the NNCF repository and NNCF API documentation.

Installation

PyPI

pip install nncf

Conda

conda install -c conda-forge nncf

For more installation details, see the page on NNCF Installation.

System Requirements

Full requirement listing is available in the NNCF GitHub Repository

Note that to optimize a model, you will need to install this model’s framework as well. Install NNCF in the same Python environment as the framework. For a list of recommended framework versions, see the framework compatibility table.

Note

Once optimized, models may be executed with the typical OpenVINO inference workflow, no additional changes to the inference code are required.

This is true for models optimized using NNCF, as well as those pre-optimized in their source frameworks, such as PyTorch, and ONNX (in Q/DQ; Quantize/DeQuantize format). The latter may be easily converted to the OpenVINO Intermediate Representation format (IR) right away.

Hugging Face Optimum Intel offers OpenVINO integration with Hugging Face models and pipelines. NNCF serves as the compression backend within the Hugging Face Optimum Intel, integrating with the widely used transformers library to enhance model performance.

Model Optimization - NNCF#

Recommended workflows#

Installation and usage#

Additional Resources#