Filter Pruning of Convolutional Models#
Introduction#
Filter pruning is an advanced optimization method that allows reducing the computational complexity of the model by removing redundant or unimportant filters from the convolutional operations of the model. This removal is done in two steps:
Unimportant filters are zeroed out by the NNCF optimization with fine-tuning.
Zero filters are removed from the model during the export to OpenVINO Intermediate Representation (IR).
Filter Pruning method from the NNCF can be used stand-alone but we usually recommend stacking it with 8-bit quantization for two reasons. First, 8-bit quantization is the best method in terms of achieving the highest accuracy-performance trade-offs so stacking it with filter pruning can give even better performance results. Second, applying quantization along with filter pruning does not hurt accuracy a lot since filter pruning removes noisy filters from the model which narrows down values ranges of weights and activations and helps to reduce overall quantization error.
Note
Filter Pruning usually requires a long fine-tuning or retraining of the model which can be comparable to training the model from scratch. Otherwise, a large accuracy degradation can be caused. Therefore, the training schedule should be adjusted accordingly when applying this method.
Below, we provide the steps that are required to apply Filter Pruning + QAT to the model:
Applying Filter Pruning with fine-tuning#
Here, we show the basic steps to modify the training script for the model and use it to zero out unimportant filters:
1. Import NNCF API#
In this step, NNCF-related imports are added in the beginning of the training script:
import torch
import nncf # Important - should be imported right after torch
from nncf import NNCFConfig
from nncf.torch import create_compressed_model, register_default_init_args
import tensorflow as tf
from nncf import NNCFConfig
from nncf.tensorflow import create_compressed_model, create_compression_callbacks, \
register_default_init_args
2. Create NNCF configuration#
Here, you should define NNCF configuration which consists of model-related parameters (“input_info” section) and parameters of optimization methods (“compression” section).
nncf_config_dict = {
"input_info": {"sample_size": [1, 3, 224, 224]}, # input shape required for model tracing
"compression": [
{
"algorithm": "filter_pruning",
"pruning_init": 0.1,
"params": {
"pruning_target": 0.4,
"pruning_steps": 15
}
},
{
"algorithm": "quantization", # 8-bit quantization with default settings
},
]
}
nncf_config = NNCFConfig.from_dict(nncf_config_dict)
nncf_config = register_default_init_args(nncf_config, train_loader) # train_loader is an instance of torch.utils.data.DataLoader
nncf_config_dict = {
"input_info": {"sample_size": [1, 3, 224, 224]}, # input shape required for model tracing
"compression": [
{
"algorithm": "filter_pruning",
"pruning_init": 0.1,
"params": {
"pruning_target": 0.4,
"pruning_steps": 15
}
},
{
"algorithm": "quantization", # 8-bit quantization with default settings
},
]
}
nncf_config = NNCFConfig.from_dict(nncf_config_dict)
nncf_config = register_default_init_args(nncf_config, train_dataset, batch_size=1) # train_dataset is an instance of tf.data.Dataset
Here is a brief description of the required parameters of the Filter Pruning method. For a full description refer to the GitHub page.
pruning_init
- initial pruning rate target. For example, value0.1
means that at the begging of training, convolutions that can be pruned will have 10% of their filters set to zero.pruning_target
- pruning rate target at the end of the schedule. For example, the value0.5
means that at the epoch with the number ofnum_init_steps + pruning_steps
, convolutions that can be pruned will have 50% of their filters set to zero.pruning_steps` - the number of epochs during which the pruning rate target is increased from ``pruning_init` to ``pruning_target
value. We recommend keeping the highest learning rate during this period.
3. Apply optimization methods#
In the next step, the original model is wrapped by the NNCF object using the create_compressed_model()
API using the
configuration defined in the previous step. This method returns a so-called compression controller and the wrapped model
that can be used the same way as the original model. It is worth noting that optimization methods are applied at this step
so that the model undergoes a set of corresponding transformations and can contain additional operations required for the
optimization.
model = TorchModel() # instance of torch.nn.Module
compression_ctrl, model = create_compressed_model(model, nncf_config)
model = KerasModel() # instance of the tensorflow.keras.Model
compression_ctrl, model = create_compressed_model(model, nncf_config)
4. Fine-tune the model#
This step assumes that you will apply fine-tuning to the model the same way as it is done for the baseline model. In the case of Filter Pruning method we recommend using the training schedule and learning rate similar to what was used for the training of the original model.
... # fine-tuning preparations, e.g. dataset, loss, optimization setup, etc.
# tune quantized model for 50 epochs as the baseline
for epoch in range(0, 50):
compression_ctrl.scheduler.epoch_step() # Epoch control API
for i, data in enumerate(train_loader):
compression_ctrl.scheduler.step() # Training iteration control API
... # training loop body
... # fine-tuning preparations, e.g. dataset, loss, optimization setup, etc.
# create compression callbacks to control pruning parameters and dump compression statistics
# all the setting are being taked from compression_ctrl, i.e. from NNCF config
compression_callbacks = create_compression_callbacks(compression_ctrl, log_dir="./compression_log")
# tune quantized model for 50 epochs as the baseline
model.fit(train_dataset, epochs=50, callbacks=compression_callbacks)
5. Multi-GPU distributed training#
In the case of distributed multi-GPU training (not DataParallel), you should call compression_ctrl.distributed()
before the
fine-tuning that will inform optimization methods to do some adjustments to function in the distributed mode.
compression_ctrl.distributed() # call it before the training loop
compression_ctrl.distributed() # call it before the training
6. Export quantized model#
When fine-tuning finishes, the quantized model can be exported to the corresponding format for further inference: ONNX in the case of PyTorch and frozen graph - for TensorFlow 2.
compression_ctrl.export_model("compressed_model.onnx")
compression_ctrl.export_model("compressed_model.pb") #export to Frozen Graph
These were the basic steps to applying the QAT method from the NNCF. However, it is required in some cases to save/load model checkpoints during the training. Since NNCF wraps the original model with its own object it provides an API for these needs.
7. (Optional) Save checkpoint#
To save model checkpoint use the following API:
checkpoint = {
'state_dict': model.state_dict(),
'compression_state': compression_ctrl.get_compression_state(),
... # the rest of the user-defined objects to save
}
torch.save(checkpoint, path_to_checkpoint)
from nncf.tensorflow.utils.state import TFCompressionState
from nncf.tensorflow.callbacks.checkpoint_callback import CheckpointManagerCallback
checkpoint = tf.train.Checkpoint(model=model,
compression_state=TFCompressionState(compression_ctrl),
... # the rest of the user-defined objects to save
)
callbacks = []
callbacks.append(CheckpointManagerCallback(checkpoint, path_to_checkpoint))
...
model.fit(..., callbacks=callbacks)
8. (Optional) Restore from checkpoint#
To restore the model from checkpoint you should use the following API:
resuming_checkpoint = torch.load(path_to_checkpoint)
compression_state = resuming_checkpoint['compression_state']
compression_ctrl, model = create_compressed_model(model, nncf_config, compression_state=compression_state)
state_dict = resuming_checkpoint['state_dict']
model.load_state_dict(state_dict)
from nncf.tensorflow.utils.state import TFCompressionStateLoader
checkpoint = tf.train.Checkpoint(compression_state=TFCompressionStateLoader())
checkpoint.restore(path_to_checkpoint)
compression_state = checkpoint.compression_state.state
compression_ctrl, model = create_compressed_model(model, nncf_config, compression_state)
checkpoint = tf.train.Checkpoint(model=model,
...)
checkpoint.restore(path_to_checkpoint)
For more details, see the following documentation.
Deploying pruned model#
The pruned model requres an extra step that should be done to get performance improvement. This step involves removal of the zero filters from the model. This is done at the model conversion step using model conversion API tool when model is converted from the framework representation (ONNX, TensorFlow, etc.) to OpenVINO Intermediate Representation.
To remove zero filters from the pruned model add the following parameter to the model conversion command:
transform=Pruning
After that, the model can be deployed with OpenVINO in the same way as the baseline model. For more details about model deployment with OpenVINO, see the corresponding documentation.