Compressing Models During Training

Introduction

Training-time model compression improves model performance by applying optimizations (such as quantization) during the training. The training process minimizes the loss associated with the lower-precision optimizations, so it is able to maintain the model’s accuracy while reducing its latency and memory footprint. Generally, training-time model optimization results in better model performance and accuracy than post-training optimization, but it can require more effort to set up.

OpenVINO provides the Neural Network Compression Framework (NNCF) tool for implementing compression algorithms on models to improve their performance. NNCF is a Python library that integrates into PyTorch and TensorFlow training pipelines to add training-time compression methods to the pipeline. To apply training-time compression methods with NNCF, you need:

  • A floating-point model from the PyTorch or TensorFlow framework.

  • A training pipeline set up in the PyTorch or TensorFlow framework.

  • Training and validation datasets.

Adding compression to a training pipeline only requires a few lines of code. The compression techniques are defined through a single configuration file that specifies which algorithms to use during fine-tuning.

NNCF Quick Start Examples

See the following Jupyter Notebooks for step-by-step examples showing how to add model compression to a PyTorch or Tensorflow training pipeline with NNCF:

Installation

NNCF is open-sourced on GitHub and distributed as a separate package from OpenVINO. It is also available on PyPI. Install it to the same Python environment where PyTorch or TensorFlow is installed.

Install from PyPI

To install the latest released version via pip manager run the following command:

pip install nncf

Note

To install with specific frameworks, use the pip install nncf[extras] command, where extras is a list of possible extras, for example, torch, tf, onnx.

To install the latest NNCF version from source, follow the instruction on GitHub.

Note

NNCF does not have OpenVINO as an installation requirement. To deploy optimized models you should install OpenVINO separately.

Working with NNCF

The figure below shows a common workflow of applying training-time compressions with NNCF. The NNCF optimizations are added to the TensorFlow or PyTorch training script, and then the model undergoes fine-tuning. The optimized model can then be exported to OpenVINO IR format for accelerated performance with OpenVINO Runtime.

_images/nncf_workflow.svg

Training-Time Compression Methods

NNCF provides several methods for improving model performance with training-time compression.

Quantization

Quantization is the process of converting the weights and activation values in a neural network from a high-precision format (such as 32-bit floating point) to a lower-precision format (such as 8-bit integer). It helps to reduce the model’s memory footprint and latency. NNCF uses quantization-aware training to quantize models.

Quantization-aware training inserts nodes into the neural network during training that simulate the effect of lower precision. This allows the training algorithm to consider quantization errors as part of the overall training loss that gets minimized during training. The network is then able to achieve enhanced accuracy when quantized.

The officially supported method of quantization in NNCF is uniform 8-bit quantization. This means all the weights and activation functions in the neural network are converted to 8-bit values. See the Quantization-aware Training guide to learn more.

Filter pruning

Filter pruning algorithms compress models by zeroing out the output filters of convolutional layers based on a certain filter importance criterion. During fine-tuning, an importance criteria is used to search for redundant filters that don’t significantly contribute to the network’s output and zero them out. After fine-tuning, the zeroed-out filters are removed from the network. For more information, see the Filter Pruning page.

Experimental methods

NNCF also provides state-of-the-art compression techniques that are still in the experimental stages of development and are only recommended for expert developers. These include:

  • Mixed-precision quantization

  • Sparsity

  • Binarization

To learn more about these methods, visit the NNCF repository on GitHub.