Filter Pruning of Convolutional Models

Introduction

Filter pruning is an advanced optimization method which allows reducing computational complexity of the model by removing redundant or unimportant filters from convolutional operations of the model. This removal is done in two steps:

  1. Unimportant filters are zeroed out by the NNCF optimization with fine-tuning.

  2. Zero filters are removed from the model during the export to OpenVINO Intermediate Representation (IR).

Filter Pruning method from the NNCF can be used stand-alone but we usually recommend to stack it with 8-bit quantization for two reasons. First, 8-bit quantization is the best method in terms of achieving the highest accuracy-performance trade-offs so stacking it with filter pruning can give even better performance results. Second, applying quantization along with filter pruning does not hurt accuracy a lot since filter pruning removes noisy filters from the model which narrows down values ranges of weights and activations and helps to reduce overall quantization error.

Note

Filter Pruning usually requires a long fine-tuning or retraining of the model which can be comparable to training the model from scratch. Otherwise, a large accuracy degradation can be caused. Therefore, the training schedule should be adjusted accordingly when applying this method.

Below, we provide the steps that are required to apply Filter Pruning + QAT to the model:

Applying Filter Pruning with fine-tuning

Here, we show the basic steps to modify the training script for the model and use it to zero out unimportant filters:

1. Import NNCF API

In this step, NNCF-related imports are added in the beginning of the training script:

import torch
import nncf  # Important - should be imported right after torch
from nncf import NNCFConfig
from nncf.torch import create_compressed_model, register_default_init_args
import tensorflow as tf

from nncf import NNCFConfig
from nncf.tensorflow import create_compressed_model, create_compression_callbacks, \
                            register_default_init_args

2. Create NNCF configuration

Here, you should define NNCF configuration which consists of model-related parameters ("input_info" section) and parameters of optimization methods ("compression" section).

nncf_config_dict = {
    "input_info": {"sample_size": [1, 3, 224, 224]}, # input shape required for model tracing
    "compression": [
        {
            "algorithm": "filter_pruning",
            "pruning_init": 0.1,
            "params": {
                "pruning_target": 0.4,
                "pruning_steps": 15
            }
        },
        {
            "algorithm": "quantization",  # 8-bit quantization with default settings
        },
    ]
}
nncf_config = NNCFConfig.from_dict(nncf_config_dict)
nncf_config = register_default_init_args(nncf_config, train_loader) # train_loader is an instance of torch.utils.data.DataLoader
nncf_config_dict = {
    "input_info": {"sample_size": [1, 3, 224, 224]}, # input shape required for model tracing
    "compression": [
        {
            "algorithm": "filter_pruning",
            "pruning_init": 0.1,
            "params": {
                "pruning_target": 0.4,
                "pruning_steps": 15
            }
        },
        {
            "algorithm": "quantization",  # 8-bit quantization with default settings
        },
    ]
}
nncf_config = NNCFConfig.from_dict(nncf_config_dict)
nncf_config = register_default_init_args(nncf_config, train_dataset, batch_size=1) # train_dataset is an instance of tf.data.Dataset

Here is a brief description of the required parameters of the Filter Pruning method. For full description refer to the GitHub page.

  • pruning_init - initial pruning rate target. For example, value 0.1 means that at the begging of training, convolutions that can be pruned will have 10% of their filters set to zero.

  • pruning_target - pruning rate target at the end of the schedule. For example, the value 0.5 means that at the epoch with the number of num_init_steps + pruning_steps, convolutions that can be pruned will have 50% of their filters set to zero.

  • pruning_steps - the number of epochs during which the pruning rate target is increased from pruning_init to pruning_target value. We recommend to keep the highest learning rate during this period.

3. Apply optimization methods

In the next step, the original model is wrapped by the NNCF object using the create_compressed_model() API using the configuration defined in the previous step. This method returns a so-called compression controller and the wrapped model that can be used the same way as the original model. It is worth noting that optimization methods are applied at this step so that the model undergoes a set of corresponding transformations and can contain additional operations required for the optimization.

model = TorchModel() # instance of torch.nn.Module
compression_ctrl, model = create_compressed_model(model, nncf_config)
model = KerasModel() # instance of the tensorflow.keras.Model
compression_ctrl, model = create_compressed_model(model, nncf_config)

4. Fine-tune the model

This step assumes that you will apply fine-tuning to the model the same way as it is done for the baseline model. In the case of Filter Pruning method we recommend using the training schedule and learning rate similar to what was used for the training of original model.

... # fine-tuning preparations, e.g. dataset, loss, optimizer setup, etc.

# tune quantized model for 50 epochs as the baseline
for epoch in range(0, 50):
    compression_ctrl.scheduler.epoch_step() # Epoch control API

    for i, data in enumerate(train_loader):
        compression_ctrl.scheduler.step()   # Training iteration control API
        ... # training loop body
... # fine-tuning preparations, e.g. dataset, loss, optimizer setup, etc.

# create compression callbacks to control pruning parameters and dump compression statistics
# all the setting are being taked from compression_ctrl, i.e. from NNCF config
compression_callbacks = create_compression_callbacks(compression_ctrl, log_dir="./compression_log")

# tune quantized model for 50 epochs as the baseline
model.fit(train_dataset, epochs=50, callbacks=compression_callbacks)

5. Multi-GPU distributed training

In the case of distributed multi-GPU training (not DataParallel), you should call compression_ctrl.distributed() before the fine-tuning that will inform optimization methods to do some adjustments to function in the distributed mode.

compression_ctrl.distributed() # call it before the training loop
compression_ctrl.distributed() # call it before the training

6. Export quantized model

When fine-tuning finishes, the quantized model can be exported to the corresponding format for further inference: ONNX in the case of PyTorch and frozen graph - for TensorFlow 2.

compression_ctrl.export_model("compressed_model.onnx")
compression_ctrl.export_model("compressed_model.pb") #export to Frozen Graph

These were the basic steps to applying the QAT method from the NNCF. However, it is required in some cases to save/load model checkpoints during the training. Since NNCF wraps the original model with its own object it provides an API for these needs.

7. (Optional) Save checkpoint

To save model checkpoint use the following API:

checkpoint = {
    'state_dict': model.state_dict(),
    'compression_state': compression_ctrl.get_compression_state(),
    ... # the rest of the user-defined objects to save
}
torch.save(checkpoint, path_to_checkpoint)
from nncf.tensorflow.utils.state import TFCompressionState
from nncf.tensorflow.callbacks.checkpoint_callback import CheckpointManagerCallback

checkpoint = tf.train.Checkpoint(model=model,
                                 compression_state=TFCompressionState(compression_ctrl),
                                 ... # the rest of the user-defined objects to save
                                 )
callbacks = []
callbacks.append(CheckpointManagerCallback(checkpoint, path_to_checkpoint))
...
model.fit(..., callbacks=callbacks)

8. (Optional) Restore from checkpoint

To restore the model from checkpoint you should use the following API:

resuming_checkpoint = torch.load(path_to_checkpoint)
compression_state = resuming_checkpoint['compression_state']
compression_ctrl, model = create_compressed_model(model, nncf_config, compression_state=compression_state)
state_dict = resuming_checkpoint['state_dict']
model.load_state_dict(state_dict)
from nncf.tensorflow.utils.state import TFCompressionStateLoader

checkpoint = tf.train.Checkpoint(compression_state=TFCompressionStateLoader())
checkpoint.restore(path_to_checkpoint)
compression_state = checkpoint.compression_state.state

compression_ctrl, model = create_compressed_model(model, nncf_config, compression_state)
checkpoint = tf.train.Checkpoint(model=model,
                                 ...)
checkpoint.restore(path_to_checkpoint)

For more details on saving/loading checkpoints in the NNCF, see the following documentation.

Deploying pruned model

The pruned model requres an extra step that should be done to get performance improvement. This step involves removal of the zero filters from the model. This is done at the model convertion step using Model Optimizer tool when model is converted from the framework representation (ONNX, TensorFlow, etc.) to OpenVINO Intermediate Representation.

  • To remove zero filters from the pruned model add the following parameter to the model convertion command: --transform=Pruning

After that the model can be deployed with OpenVINO in the same way as the baseline model. For more details about model deployment with OpenVINO, see the corresponding documentation.